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Abstract 

Personalized medicine has received increasing attention among statisticians, com¬ 
puter scientists, and clinical practitioners. A major component of personalized medicine 
is the estimation of individualized treatment rules (ITRs). Recently, Zhao et al. (2012) 
proposed outcome weighted learning (OWL) to construct ITRs that directly optimize 
the clinical outcome. Although OWL opens the door to introducing machine learning 
techniques to optimal treatment regimes, it still has some problems in performance. 

(1) The estimated ITR of OWL is affected by a simple shift of the outcome. (2) The 
rule from OWL tries to keep treatment assignments that subjects actually received. 

(3) There is no variable selection mechanism with OWL. All of them weaken the finite 
sample performance of OWL. In this article, we propose a general framework, called 
Residual Weighted Learning (RWL), to alleviate these problems, and hence to improve 
finite sample performance. Unlike OWL which weights misclassification errors by clin¬ 
ical outcomes, RWL weights these errors by residuals of the outcome from a regression 
fit on clinical covariates excluding treatment assignment. We utilize the smoothed ramp 
loss function in RWL, and provide a difference of convex (d.c.) algorithm to solve the 
corresponding non-convex optimization problem. By estimating residuals with linear 
models or generalized linear models, RWL can effectively deal with different types of 
outcomes, such as continuous, binary and count outcomes. We also propose variable 
selection methods for linear and nonlinear rules, respectively, to further improve the 
performance. We show that the resulting estimator of the treatment rule is consistent. 

We further obtain a rate of convergence for the difference between the expected outcome 
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using the estimated ITR and that of the optimal treatment rule. The performance of 
the proposed RWL methods is illustrated in simulation studies and in an analysis of 
cystic fibrosis clinical trial data. 

Keywords: Optimal Treatment Regime, RKHS, Universal consistency, Convergence Rate, 
Residuals, Martingale residuals. 

1 Introduction 

Personalized medicine is a medical paradigm that utilizes individual patient information to 
optimize patients health care. Recently, personalized medicine has received much attention 
among statisticians, computer scientists, and clinical practitioners. The primary motivation 
is the well-established fact that patients often show significant heterogeneity in response 
to treatments. For instance, in a recent study, it was demonstrated that the optimal 
timing for the initiation of antiretroviral therapy (ART) varies in patients co-infected with 
human immunodeficiency virus and tuberculosis. Patients with CD4+ T-cell counts of less 
than 50 per cubic millimeter benefited substantially from earlier ART with a lower rate of 
new AIDS-defining illnesses and mortality as compared with later ART, while those with 
larger CD4+ T-cell counts did not have such a benefit (Havlir et al. 2011). The inherent 
heterogeneity across patients suggests a transition from the traditional “one size fits all” 
approach to modern personalized medicine. 

A major component of personalized medicine is the estimation of individualized treat¬ 
ment rules (ITRs). Formally, the goal is to seek a rule that assigns a treatment, from 
among a set of possible treatments, to a patient based on his or her clinical, prognostic or 
genomic characteristics. The individualized treatment rules are also called optimal treat¬ 
ment regimes. There is a significant literature on individualized treatment strategies based 
on data from clinical trials or observational studies (Murphy 2003; Robins 2004; Zhang 
et al. 2012; Zhao et al. 2009). Much of the work relies on modeling either the conditional 
mean outcomes or contrasts between mean outcomes. These methods obtain ITRs indi¬ 
rectly by inverting the regression estimates instead of directly optimizing the decision rule. 
For instance, Qian and Murphy (2011) applied a two-step procedure that first estimates a 
conditional mean for the outcome and then determines the treatment rule by comparing 
the conditional means across various treatments. The success of these indirect approaches 
highly depends on correct specification of posited models and on the precision of model 
estimates. 

In contrast, Zhao et al. (2012) proposed outcome weighted learning (OWL), using data 
from a randomized clinical trial, to construct an ITR that directly optimizes the clinical out¬ 
come. They cast the treatment selection problem as a weighted classification problem, and 
apply state-of-the-art support vector machines for implementation. This approach opens 
the door to introducing machine learning techniques into this area. However, there is still 
significant room for improved performance. First, the estimated ITR of OWL is affected 
by a simple shift of the outcome. This behavior makes the estimate of OWL unstable. Sec¬ 
ond, since OWL requires the outcome to be nonnegative, OWL works similarly to weighted 
classification, in which misclassification errors, the differences between the estimated and 
true treatment assignments, are targeted to be reduced. Hence the ITR estimated by OWL 
tries to keep treatment assignments that subjects actually received. This is not always ideal 
since treatments are randomly assigned in the trial, and the probability is slim that the 
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majority of subjects are assigned optimal treatments. Third, OWL does not have variable 
selection features. When there are many clinical covariates which are not related to the 
heterogeneous treatment effects, variable selection is critical to the performance. 

To alleviate these problems, we propose a new method, called Residual Weighted Learn¬ 
ing (RWL). Unlike OWL which weights misclassification errors by clinical outcomes, RWL 
weights these errors by residuals from a regression fit of the outcome. The predictors of 
the regression model include clinical covariates, but exclude treatment assignment. Thus 
the residuals better reflect the heterogeneity of treatment effects. Since some residuals are 
negative, the hinge loss function used in OWL is not appropriate. We instead utilize the 
smoothed ramp loss function, and provide a difference of convex (d.c.) algorithm to solve 
the corresponding non-convex optimization problem. The smoothed ramp loss resembles 
the ramp loss (Collobert et al. 2006), but it is smooth everywhere. It is well known that 
ramp loss related methods are shown to be robust to outliers (Wu and Liu 2007). The 
robustness to outliers for the smoothed ramp loss is helpful for RWL, especially when resid¬ 
uals are poorly estimated. Moreover, through using residuals, RWL is able to deal relatively 
easily with almost all types of outcomes. For example, RWL can work with generalized lin¬ 
ear models to construct ITRs for count/rate outcomes. We also propose variable selection 
approaches in RWL for linear and nonlinear rules, respectively, to further improve finite 
sample performance. 

The theoretical analysis of RWL focuses on two aspects, universal consistency and 
convergence rate. The notion of universal consistency is borrowed from machine learning. 
It requires for a learning method that when the sample size approaches infinity the method 
eventually learns the Bayes rule without knowing any specifics of the distribution of the 
data. We show that RWL with a universal kernel (e.g. Gaussian RBF kernel) is universally 
consistent. In machine learning, there is a famous “no-free-lunch theorem”, which states 
that the convergence rate of any particular learning rule may be arbitrarily slow (Devroye 
et al. 1996). In this article, we prove the “no-free-lunch theorem” for finding ITRs. Thus 
the rate of convergence studies for a particular rule must necessarily be accompanied by 
conditions on the distribution of the data. For RWL with Gaussian RBF kernel, we show 
that under the geometric noise condition (Steinwart and Scovel 2007) the convergence rate 
is as high as n -1//3 . 

At first glance, one may think that there is not a large difference between OWL and 
RWL except that RWL uses residuals as alternative outcomes. Actually, RWL enjoys many 
benefits from this simple modification. First, by using residuals of outcomes, RWL is able 
to reduce the variability introduced by the original outcomes. Second, since the numbers of 
subjects with positive and negative residuals are generally balanced, the ITR determined 
by RWL favors neither the treatment assignments that subjects actually received nor their 
opposites. Because of the above reasons, RWL improves finite sample performance. Third, 
RWL possesses location-scale invariance with respect to the original outcomes. Specifically, 
the estimated rule of RWL is invariant to a shift of the outcome; it is invariant to a scaling 
of the outcome with a positive number; the rule from RWL that maximizes the outcome is 
opposite to the rule that minimizes the outcome. These are intuitively sensible. 

The contributions of this article are summarized as follows. (1) We propose the general 
framework of Residual Weighted Learning to estimate individualized treatment rules. By 
estimating residuals with linear or generalized linear models, RWL can effectively deal with 
different types of outcomes, such as continuous, binary and count outcomes. For censored 
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survival outcomes, RWL could potentially utilize martingale residuals, although theoretical 
justification is still under development. (2) We develop variable selection techniques in 
RWL to further improve performance. (3) We present a comprehensive theoretical analysis 
of RWL on universal consistency and convergence rate. (4) As a by-product, we show 
the “no-free-lunch theorem” for ITRs that the convergence rate of any particular rule may 
be arbitrarily slow. This is a generic result, and it applies to any algorithm for optimal 
treatment regimes. 

The remainder of the article is organized as follows. In Section 2, we discuss Outcome 
Weighted Learning (OWL) and propose Residual Weighted Learning (RWL) to improve 
finite sample performance for continuous outcomes. Then we develop a general framework 
for RWL to handle other types of outcomes, and use binary and count/rate outcomes 
as examples. In Section 3, we establish consistency and convergence rate results for the 
estimated rules. The variable selection techniques for RWL are discussed in Section 4. 
We present simulation studies to evaluate performance of the proposed methods in Section 
5. The method is then illustrated on the EPIC cystic fibrosis randomized clinical trial 
(Treggiari et al. 2009, 2011) in Section 6. We conclude the article with a discussion in 
Section 7. All proofs are given in Appendix. 

2 Methodology 

2.1 Outcome Weighted Learning 

Consider a two-arm randomized trial. We observe a triplet ( X,A,R ) from each patient, 
where X = (X\, ■ ■ ■ ,X p ) T € X denotes the patient’s clinical covariates, A £ A = {1, —1} 
denotes the treatment assignment, and R is the observed clinical outcome, also called the 
“reward” in the literature on reinforcement learning. We assume that R is bounded, and 
larger values of R are more desirable. An individualized treatment rule (ITR) is a function 
from X to A. Let n(a,x) := P(A = a\X = x) be the probability of being assigned 
treatment a for patients with clinical covariates x. It is predefined in the trial design. 
Here we consider a general situation. In most clinical trials, the treatment assignment is 
independent of X , but in some designs, such as stratified designs, A may depend on X. 
We assume n (a, x) > 0 for all a € A and x € X. 

An optimal ITR is a rule that maximizes the expected outcome under this rule. Math¬ 
ematically, the expected outcome under any ITR d is given as 

E(R\A = d(X)) = E = d(X))) , (1) 

where I(-) is the indicator function. Interested readers may refer to Qian and Murphy 
(2011) and Zhao et al. (2012) regarding the derivation of (1). This expectation is called the 
value function associated with the rule d, and is denoted V(d). In other words, the value 
function V(d) is the expected value of R given that the rule d(X) is applied to the given 
population of patients. An optimal ITR d* is a rule that maximizes V(d). Finding d* is 
equivalent to the following minimization problem: 

d*£argmm E (^^]l(.4^(X))). (2) 
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Zhao et al. (2012) viewed this as a weighted classification problem, in which one wants to 
classify A using X but also weights each misclassification error by R/n(A, X). 

Assume that the observed data {(a;*, a*, 7-j) : * = 1, ■ ■ ■ , n} are collected independently. 
For any decision function f(x), let dj(x) = sign(f(x )) be the associated rule, where 
sign(u ) = 1 for u > 0 and —1 otherwise. The particular choice of the value of sign( 0) is 
not important. In this article, we fix sign{ 0) = —1. Using the observed data, the weighted 
classification error in (2) can be approximated by the empirical risk 



i =1 


n 


I 


Xi 




(3) 


The optimal decision function f(x) is the one that minimizes the outcome weighted error 
(3). 


It is well known that empirical risk minimization for a classification problem with the 0-1 
loss function is an NP-hard problem. To alleviate this difficulty, one often finds a surrogate 
loss to replace the 0-1 loss. Outcome weighted learning (OWL) proposed by Zhao et al. 
(2012) uses the hinge loss function, and also applies the regularization technique used in 
the support vector machine (SVM) (Vapnik 1998). In other words, instead of minimizing 
(3), OWL aims to minimize 


1 n r- 

n “ 7T (a,i,Xi) 


1 - a,if(xi )) 


+ AH/II 2 , 


(4) 


where (u)+ = max(u, 0) is the positive part of u, ||/|| is some norm for /, and A is a tuning 
parameter controlling the trade-off between empirical risk and complexity of the decision 
function /. 

OWL opens the door to the application of statistical learning techniques to personalized 
medicine. However, this approach is not perfect. First, a simple shift on the outcome R 
should not affect the optimal decision rule, as seen from (1) and (2). That is, d* in (2) does 
not change if R is replaced by R + c, for any constant c. Unfortunately this nice invariance 
property does not hold for the decision function f(x) of OWL in (4) when r t is replaced 
by 7‘i + c. Consider an extreme case where c is very large and 7r(l,*) = 7r(—1,®) = 0.5 
for all x € X, the weights (rj + c)/n(ai, Xi) are almost identical, and the weighted problem 
is approximately transformed to an unweighted one. Hence the performance of OWL can 
be affected by a simple shift of R. Second, OWL further assumes that R is nonnegative 
to gain computational efficiency from convex programming. A direct consequence of this 
assumption, as seen from (3), is that the treatment regime df{xj) tends to match the 
treatment Oj that was actually assigned to the patient, especially when the decision function 
/ is chosen from a rich class of functions. This property is not ideal for data from a 
randomized clinical trial, since treatments are actually randomly assigned to patients. 


2.2 Residual Weighted Learning 

In this section, we only consider the case that the outcome R is continuous, and extend the 
framework to other types of outcomes in Section 2.4. 

As demonstrated previously, the decision rule in (2) is invariant to a shift of outcome 
R by any constant. Moreover, Lemma 2.1 shows that d* in (2) is invariant under a shift 
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of R by a function of X. That is, d* remains unchanged if R is replaced by R — g(X) 
for any function g, as long as g is not related to d. So there is an optimal solution / 
in (4) corresponding to a particular function g , when 7 / is replaced by 7 / — g{xf), and the 
associated rule d/ = sign(f) can be seen as an estimated optimal rule in (2). As g varies, we 
obtain a collection of d/’s. When the sample size is very large, these d/’s perform similarly 
by the infinite sample property of OWL shown in Zhao et al. (2012). However, when the 
sample size is limited, as in clinical settings, the choice of g is critical to the performance 
of OWL. Zhao et al. (2012) did not delve deeply into this problem. In this article, we will 
provide a solution to this, and improve finite sample performance. 

Lemma 2.1. For any measurable g : X —>• R and any probability distribution for (X, A, R), 

E (wr'i 4 * - E (to ,(a * d(x>) ) ~ E MX>) ■ 


Intuitively, the function g with the smallest variance of — 7 ^ d(X)) is a 

good choice. However, such a function depends on the decision rule d, as shown in the 
following theorem. The proof of the theorem is provided in Appendix. 

R 


Theorem 2.2. Among all measurable g : X —>• 
is the function g that minimizes the variance of 


R, g(X) = E 
R-g(X) 


ir(A,X) 
l(A^d(X)) 


l(A?d(X)) 


X 


vr (A,X) 

Our purpose is to find a function g, which is not related to d, to reduce the variance of 

R-g(x), 


1 t(A,X 
be written as, 


I (A 7 ^ d(X)) as much as possible. As shown in the proof, the minimizer g can 


g(X) = E(R\X,A = l)l(d(X) / 1) +E(R\X,A = -l)l(d(X) / -l). 


That is, g(X) jumps between E(i?|X, A = 1) and E(1?|X, A = —1) as d(X) varies. Hence 
when d is unknown, a reasonable choice of g is 

E(R\X,A = 1)+E(R\X,A = -1) w ( R 

3(x) = -2- = e (mIw) 

We propose to minimize the following empirical risk, rather than the original one in (3): 



jl r i ~ £l*( x i) j 

n n(ai,Xi) 


xf) 7 ^ Cli'j 1 


( 6 ) 


where g* is an estimate of g*. For simplicity, let fj = r j — g*(xi) be the estimated residual. 
Here, we do not weight misclassihcation errors by clinical outcomes as OWL does, and 
instead we weight them by residuals from a regression fit of outcomes. Thus the optimal 
decision function is invariant under any translation of clinical outcomes. We call the method 
Residual Weighted Learning (RWL). 

RWL also has the following interpretation. The outcome R can be characterized as 


R = g(X) + 5{X)-A + e, 
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X X 


Figure 1: Example of residual weighted learning. The raw data are shown on the left, 
consisting of a single covariate X, treatment assignment A = 1 or — 1, and continuous 
outcome R with E(R\X, A) = 3X + XA. The allocation ratio is 1:1. The residuals, 
R — 3X, are shown on the right. 


where e is the mean zero random error. g{X) reflects common effects of clinical covariates 
X for both treatment arms. It is easy to see that n{X) = (E(i?|X,yl = 1) + E(R|X,A = 
—1))/2, so the residual, R — g*(X), in RWL is just <5(X) • A + e, which captures all sources 
of heterogeneous treatment effects. The rationale of optimal treatment regimes is to keep 
treatment assignments that subjects have actually received if those subjects are observed 
to have large outcomes, and to switch assignments if outcomes are small. However, the 
largeness and smallness for outcomes are relative. A rather large outcome may still be con¬ 
sidered as small when compared with subjects having similar clinical covariates, as shown 
in Figure 1. Outcomes are not comparable among subjects with different clinical covariates, 
while the residual, by removing common covariates effects, is a better measurement. Fig¬ 
ure 1 illustrates how residuals work using an example with a single covariate X. The raw 
data are shown on the left, and residuals are shown on the right. Residuals are comparable 
among subjects, and larger residuals represent better outcomes. 
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Another benefit from using residuals is a clear cut-off, i.e. 0, to distinguish between 
subjects with good and poor clinical outcomes. To minimize the empirical risk in (6), for 
subjects with positive residuals, RWL is apt to recommend the same treatment assignments 
that subjects have actually received; for subjects with negative residuals, RWL is more likely 
to give the opposite treatment assignments to what they have received. The optimal ITR is 
the one maximizing the conditional expected outcome given in (1). An empirical estimator 
of (1) is, 


1 

n 


E 


n 

7 5 ) 



(7) 


Though it is unbiased, it may give an estimate outside the range of R. While for finite 
samples a better estimator of (1) as shown in Murphy (2005) is, 


1 ri'lyd(xi') — CLij 

n 7r 

1 sr^n I (rf(a?i)— a i) 

Tl £-^i =1 7T (CLi,Xi) 


The denominator is an estimator of E 


Ip=d(X)) ^ 
AAX) )■ 


( 8 ) 


Similar reasoning as that used in 


the proof of Lemma 2.1 yields E 


l(A=d(X)) ) 
n(A,X) 


= 1. The estimator 


1 y-m di'j 

n 2-*i= 1 7T (a,i,Xi) 


IS 


called the treatment matching factor in this article. The factor varies between 0 and 2. 
If a rule favors treatment assignments that subjects have actually received, its treatment 
matching factor is greater than 1, while in contrast if a rule prefers the opposite treatment 
assignments, its treatment matching factor is less than 1. For a randomized clinical trial, 
we expect that the estimated rule is associated with a treatment matching factor close to 1. 
For OWL, since all the weights are nonnegative, the estimated rule of OWL tends to keep, 
if possible, the treatments that subjects actually received. Thus the associated treatment 
matching factor would be greater than 1, especially when the sample size is small, or when 
a complicated rule is applied. Hence the estimator in (8) might not be large even though 
(7) is maximized. RWL alleviates this problem by using residuals to make an initial guess 
on the optimal rule. Owing to the balance between subjects with positive and negative 
residuals, RWL implicitly finds a rule with its treatment matching factor close to 1. 

There are many ways to estimate g*. In this article, we consider two models. The 


first one is the main effects model. Assume that E 


R 


X ) = /3q + X r (3, where 


P = {Pir- ,Pp ) T 

weighted squares, 


2it {A,X) 

The estimates P and /3q can be obtained by minimizing the sum of 


1 


“ 2vr (ai,Xi) 


in- Po-xfpf 


It can be solved easily by almost any statistical software. Then the estimate is g*(x) = 


R 


X = Pq. It is easy 


Pq + x T 0. The second is the null model. Assume that E . 

\27r(A, X) 

to obtain g*{x) = p 0 = ELi ■ 

In summary, we propose a method called Residual Weighted Learning to identify the 
optimal ITR by minimizing the residual weighted classification error (6). The impact of 
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using residuals is two-fold: it stabilizes the variance of the value function and controls the 
treatment matching factor. Both improve finite sample performance. 


2.3 Implementation of RWL 


As in OWL, we intend to use a surrogate loss function to replace the 0-1 loss in (6). Since 
some residuals are negative, convex surrogate functions do not work here. We consider a 
non-convex loss 


T(u) 


' 0 

(1 -n? 

2 — (1 + u) 2 
2 


if u > 1, 
if 0 < u < 1, 
if — 1 < u < 0, 
if u < —1. 


It is called the smoothed ramp loss in this article. Figure 2 shows the hinge loss, ramp 
loss and smoothed ramp loss functions. The hinge loss is the loss function used in support 
vector machines (Vapnik 1998). The ramp loss (Collobert et al. 2006) is also called the 
truncated hinge loss function (Wu and Liu 2007). It is well known that by truncating the 
unbounded hinge loss, ramp loss related methods are shown to be robust to outliers in 
the training data for the classification problem (Wu and Liu 2007). Compared with the 
ramp loss, the smoothed ramp loss is smooth everywhere. Hence it has computational 
advantages in optimization. Moreover, the smoothed ramp loss, which resembles the ramp 
loss, is robust to outliers too. In the framework of RWL, subjects who did not receive 
optimal treatment assignments but had large positive residuals, or those who did receive 
optimal assignments but had large negative residuals may be considered as outliers. The 
robustness to outliers for the smoothed ramp loss is helpful in dealing with outliers in the 
setting of optimal treatment regimes, especially when residuals are poorly estimated, or 
when the outcome R has a large variance. 

We incorporate RWL into the regularization framework, and aim to minimize 


1 

n 


E 


—- T{ ai f(xi )) + ^ 
ir{ai,Xi) 2 


(9) 


where ||/|| is some norm for /, and A is a tuning parameter. Recall that r, is the estimated 
residual. The smoothed ramp loss function T{u) is symmetric about the point (0,1) as 
shown in Figure 2(c). A nice property that comes from the symmetry is that the rule that 
minimizes the outcome R ( i.e. maximizes —R) is just opposite to the rule that maximizes 
R. This is intuitively sensible. However, OWL does not possess this property. 

We derive an algorithm for linear RWL in Section 2.3.1, and then generalize it to the 
case of nonlinear learning through kernel mapping in Section 2.3.2. 


2.3.1 Linear Decision Rule for Optimal ITR 

Suppose that the decision function f(x) that minimizes (9) is a linear function of x , i.e. 
f(x) = w T x + b. Then the associated ITR will assign a subject with clinical covariates x 
into treatment 1 if w T x + b > 0 and —1 otherwise. In (9), we define ||/|| as the Euclidean 
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u u 

(a) Hinge loss (b) Ramp loss 




U U 

(c) Smoothed ramp loss (d) Decomposition 

Figure 2: Hinge loss (a), ramp loss (b), and smoothed ramp loss (c) functions, (d) shows 
the difference of convex decomposition of the smoothed ramp loss, T(u) = (j>i(u) — 4>o(u). 
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norm of w. Then minimizing (9) can be rewritten as 


min 

w,b 


w T w + 




n ^ ir(ai,Xi) 


T(ai(w T Xi + b )). 


( 10 ) 


The smoothed ramp loss is a non-convex loss, and as a result, the optimization problem 
in (10) involves non-convex minimization. This optimization problem is difficult since 
there are many local minima or stationary points. For instance, any ( w,b ) with w = 0 
and |b| > 1 is a stationary point. Similar with the robust truncated hinge loss support 
vector machine (Wu and Liu 2007), we apply the d.c. (Difference of Convex) algorithm 
(An and Tao 1997) to solve this non-convex minimization problem. The d.c. algorithm is 
also known as the Concave-Convex Procedure (CCCP) in the machine learning community 
(Yuille and Rangarajan 2003). Assume that an objective function can be rewritten as the 
sum of a convex part Q vex { ©) and a concave part Qcav(®)- The d.c. algorithm as shown 
in Algorithm 1 solves the non-convex optimization problem by minimizing a sequence of 
convex subproblems. One can easily see that the d.c. algorithm is a special case of the 
Majorize-Minimization (MM) algorithm. 


Algorithm 1: The d.c. algorithm for minimizing Q(Q ) = Q ve x{ ©) + Qcavi ©)• 

Initialize O^ 0 ^ 

repeat 

©h+L = argmirig, Q vex (0) + < Q' cav { O W ),0 > 
until convergence of 0^ 


Let 


(f>s{u) 


0 if u > s 

(s — u ) 2 if s — l<u<s 
2s — 2u — 1 if u < s — 1 


Note that <f s is smooth. We have a difference-of-convex decomposition of the smoothed 
ramp loss, 

T{u) = 4>i(u) ~ M u )i (H) 


as shown in Figure 2(d). Denote 0 as ( w,b ). Applying (11), the objective function in (10) 
can be decomposed as 


X 1 n 

Q s {©) = -W J W + - Y] > 0 ) + (j)o(ui)I(fi < 0 ) 

2 n z —^ L 


2—1 


- ^(^2 5 3?2 ) 


Qvex(S) 


1 U 

■-\(t>i{ui)I(ri < 0 ) + cj)o{ui)I(ri > 0 ) 

n z ' L 
2—1 


- ^"(^2 5 3Ci) 


Qcav (©) 
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where Ui = cn(w T Xi + b). For simplicity, we introduce the notation, 


Pi = 


dQc 


dm 


1 < 0) + #oW I( .. > 0) 


n 


dui 


dui 


- ^(^-2 5 *^ i ) 


( 12 ) 


for i = 1, • • • , n. Thus the convex subproblem at the (t+l)’th iteration of the d.c. algorithm 
is 


min 

w,b 


A T 1 
—w w H— 
2 n 


n 

^ > 0) + (^o(uj)I(fi < 0) 

2—1 





(13) 


There are many efficient methods available for solving smooth unconstrained optimization 
problems. In this article, we use the limited-memory Broyden-Fletcher-Goldfarb-Shanno 
(L-BFGS) algorithm (Nocedal 1980). L-BFGS is a quasi-Newton method that approxi¬ 
mates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm using a limited amount of 
computer memory. For more information on the quasi-Newton method and L-BFGS, see 
Nocedal and Wright (2006). 

The procedure is summarized in the following algorithm: 


Algorithm 2: The d.c. algorithm for linear RWL with the smoothed ramp loss 
Set e to a small quantity, say, 10~ 8 
Initialize /3j 0) = —— T I(fj < 0) 

repeat 

• Compute ( w,b ) by solving (13), 

• Update Ui = a,i(w T Xi + b) 

• Update y +1) by (12) 
until ||/3^ +1 ) — /3W||oo < 6 


2.3.2 Nonlinear Decision rule for Optimal ITR 


For a nonlinear decision rule, the decision function f(x) is represented by h(x) + b with 
h(x) € Hk and b € M, where Hk is a reproducing kernel Hilbert space (RKHS) associated 
with a Mercer kernel function K. The kernel function K(-, ■) is a positive definite function 
mapping from X x X to M. The norm in 'Hk, denoted by || • ||a'j is induced by the following 
inner product: 

n m 

< f, g >k= yy C i i p j K(xi,x j ), 
i =1 j =1 


for /(•) = Er=i 
ten as 


ctiK (•, Xi ) and g(-) = Xj ). Then minimizing (9) can be rewrit- 


min 

h,b 




i 2 -t- 
IK + 


n 

^^j T ( a ‘( h ^ + b ))' 


(14) 
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Due to the representer theorem (Kimeldorf and Wahba 1971), the nonlinear problem can 
be reduced to finding finite-dimensional coefficients Vi, and h(x) can be represented as 
'^2'j = i v jK(x, x j). Thus the problem (14) is transformed to 

\ n l n v- n 

mi , n nY^v i VjK(x i ,x j ) + -^2—^—T(a i (^2v j K(x i ,x j ) + b)). (15) 

v , b z , . =i n , =i Tr\a%,Xi) j=1 

Following a similar derivation to that used in the previous section, the convex subproblem 
at the (t + l)’th iteration of the d.c. algorithm is as follows, 


min 

v,b 


^ it -j it 

2 2 v i v 3 K ( x i’ x j) + -J2 

i,j=l i =1 


^i(ui)I(fj > O) + 0 o (ui)I(fj < 0) 


J 7 T (ai,Xi) + 


(*) 


Ui 


where rq = a* (^” =1 VjK(xi, Xj) + b ). After solving the subproblem by L-BFGS, we update 
/ 0( t+1 ) by (12). The procedure is repeated until (3 converges. When we obtain the solution 
(v,b), the decision function is f(x) = ^)” =1 VjK(x,Xj) + b. Note that if we choose a 
linear kernel K(x,z) = x T z , the obtained rule reduces to the previous linear rule. The 
most widely used nonlinear kernel in practice is the Gaussian Radial Basis Function (RBF) 
kernel, that is, 


K a {x,z) 


exp 



where a > 0 is a free parameter whose inverse 1/a is called the width of K a . 


2.4 A general framework for Residual Weighted Learning 

We have proposed a method to identify the optimal ITR for continuous outcomes. However, 
in clinical practice, the endpoint outcome could also be binary, count, or rate. In this 
section, we provide a general framework to deal with all these types of outcomes. 

The procedure is described as follows. First, estimate the conditional expected outcomes 
given clinical covariates X , E (^ 07T (a x) |-^~) > by an appropriate regression model. It is 
equivalent to fitting a weighted regression model, in which each subject is weighted by 
2 tt(A x ) ' Second, we calculate the estimated residual f t by comparing the observed outcome 
and expected outcome estimated in the first step. If the observed outcome is better than 
the expected one, the estimated residual fj is positive, otherwise it is negative. Specifically, 
if larger values of R are more desirable, as with our assumption for continuous outcomes, 
ri = rj — E ( ^(AX) \ ^ = **) i ^ smaller values of R are preferred, e.g. the number of 

adverse events, then fi = E ^ o n (A,x) = **) — r i- Third, identify the decision function 
f(x) by using the estimated fj in RWL. 

The underlying idea is simple. Generally, the outcome R is a random variable depend¬ 
ing on X and A. We estimate common effects of X by ignoring the treatment assignment 
A, and then all information on the heterogeneous treatment effects is contained in the 
residuals. As demonstrated previously, the benefits from using residuals include stabilizing 
the variance and controlling the treatment matching factor, both of which have a positive 
impact on finite sample performance. In previous sections, we discussed RWL for continu¬ 
ous outcomes in detail. We will provide two additional examples to illustrate the general 
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framework. 

The first example is for binary outcomes, i.e. R G {0,1}. Assume that R = 1 is 
desirable. We may fit a weighted main effects logistic regression model, 


E(R|AT) 


exp (A) + X T (3) 

1 + exp(/3 0 + X T (3) 


The estimates /3o and (3 can be obtained numerically by statistical software, for example, 
the glm function in R. The residual fj is r*- exp($o+x $) . qq ien we estimate the optimal 

l+exp(/3o+X J p) 

ITR by RWL. 

The second is for count/rate outcomes. We use the pulmonary exacerbation (PE) 
outcome of the cystic fibrosis data in Section 6 as an example. The outcome is the number of 
PEs during the study, and it is a rate variable. The observed data are D n = {xi , a*, r l ,t l }2 =l , 
where r ? ; is the number of PEs in the duration t,. We treat the count r\ as the outcome, 
and ( Xi,ti ) as clinical covariates. We may fit a weighted main effects Poisson regression 
model, 

log (E(R\X, T)) = (3 0 + X t (3 + log(T). 

Since fewer PEs are desirable, we compute r* as exp ^/3 0 + xj (3 + log(L)) — J"j, which is the 
opposite of the raw residual in generalized linear models. 


3 Theoretical Properties 


In this section, we establish theoretical results for RWL. To the end, we assume that a 
sample D n = {xi,a,i,ri}™ =1 , is independently drawn from a probability measure P on 
X x A x Ad, where X C M p is compact, and Ad = [— M, M] C M. For any ITR d : X —>■ A, 
the risk is defined as 


R(d) = E 


R 

<A,X) 


I(A / d(X)) 


The ITR that minimizes the risk is the Bayes rule d* = argmin^T Z(d), and the correspond¬ 
ing risk 1Z* = lZ(d*) is the Bayes risk. Recall that the Bayes rule is d*(x) = sign(E(R\X = 
x,A= 1) -E(i?|X = x,A = -1)). 

In RWL, we substitute the 0-1 loss I(A ^ sign(f(X ))) by the smoothed ramp loss, 
T(Af(X)). Accordingly, we define the T-risk as 


XtM) = E 


' R-g(x) 
. AAX) 


T{Af{X )) 


and, similarly, the minimal T-risk as 7 Zj, = inf^ TZT,g(f) and fx g = argminy R,T,g(f)- 
In the theoretical analysis, we do not require g to be a regression fit of R. g can be 
any arbitrary measurable function. We may further assume that (R — g(X))/iT{A,X) is 
bounded almost surely. 

The performance of an ITR associated with a real-valued function / is measured by the 
excess risk lZ(sign(f)) — 1Z*. In terms of the value function, we note that the excess risk is 
just V(d*) — V(sign(f)). Similarly, we define the excess T-risk as TZx,g(f) — TZ? g - 

Let f Dn:Xn € Rk + {±}, i.e. f Dn ,\ n = h Dn ,\ n + b Dn ,\ n where h Dni \ n € Hr and b Dn}Xn € 
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R, be a global minimizer of the following optimization problem: 


min 

/=/i+fee%K+{1} 


i 

2—1 


ri-g{xj) T 
^{cLi , 3 ?^) 



Here we suppress g from the notation of f Dn ,\ n , h Dn ,\ n and b Dnt \ n . 

The purpose of the theoretical analysis is to estimate the excess risk TZ(sign(fD n ,\ n )) — 
1Z* as the sample size n tends to infinity. Convergence rates will be derived under the choice 
of the parameter X n and conditions on the distribution P. 


3.1 Fisher Consistency 

We establish Fisher consistency of the decision function based on the smoothed ramp loss. 
Specifically, the following result holds: 

Theorem 3.1. For any measurable function f, if ff g minimizes T-risk lZx,g{f), then the 
Bayes rule d*(x ) = sign(ff g {x)) for all x such that E(T|W = x, A = 1) 7 ^ E(I?|W = 
x,A = —1). Furthermore, 1ZT,g(d*) = 'R-fg- 

The proof is provided in Appendix. This theorem shows the validity of using the 
smoothed ramp loss as a surrogate loss in RWL. 


3.2 Excess Risk 

The following result establishes the relationship between the excess risk and excess T-risk. 
The proof can be found in Appendix. 

Theorem 3.2. For any measurable f : X —» R and any probability distribution for 
C X,A,R), 

lZ(sign(f)) - K* < n T ,g(f) - TV Tg . 

This theorem shows that for any decision function /, the excess risk of / under the 0-1 
loss is no larger than the excess risk of / under the smoothed ramp loss. It implies that if 
we obtain / by approximately minimizing TZT,g(f) so that the excess T-risk is small, then 
the risk of / is close to the Bayes risk. In the next two sections, we investigate properties 
of fD n ,x n through the excess T-risk. 

3.3 Universal Consistency 

In the literature of machine learning, a classification rule is called universally consistent if 
its expected error probability converges in probability to the Bayes error probability for 
all distributions underlying the data (Devroye et al. 1996). We extend this concept to 
ITRs. Precisely, given a sequence D n of training data, an ITR d n is said to be universally 
consistent if 

lim K(dn) = 'll* 

n—>■ 00 

holds in probability for all probability measures P on X x A x A4. In this section, we will 
establish universal consistency of the rule = sign{fD n ,\ n )- Before proceeding to the 

main result, we introduce some preliminary background on RKHSs, which is essential for 
our work. 
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Let K be a kernel, and Hk be its associated RKHS. We often use the quantity 


Ck = sup y/K(x, x). 

In this article, we assume Ck is finite. For the Gaussian RBF kernel, Ck = 1. The 
assumption also holds for the linear kernel when X C M p is compact. By the reproducing 
property, it is easy to see that ||/||oo < Cft:||/||iO for any / £ Hk- A continuous kernel 
If on a compact metric space X is called universal if its associated RKHS Hk is dense 
in C(X), i.e., for every function g £ C(X) and all e > 0 there exists an / £ 'Hk such 
that ||/ — fliloo < e (Steinwart and Christmann 2008, Definition 4.52). C(X) is the space 
of all continuous functions / : X —>• R on the compact metric space X endowed with the 
usual supremum norm. The widely used Gaussian RBF kernel is an example of a universal 
kernel. 

We are ready to present the main result of this section. The following theorem shows 
the convergence of the T-risk on the sample dependent function /#\ n . We apply empirical 
process techniques to show consistency. The proof is provided in Appendix. 

Theorem 3.3. Assume that we choose a sequence X n > 0 such that X n —y 0 and n\ n —>• oo. 
For a measurable function g, assume that almost surely. Then for any 

distribution P for (X, A, R), we have that in probability, 

I™ KtMd-k) = f Jn! +m K T , s (f). 

By Theorem 3.1, the Bayes rule d* minimizes the T-risk 7Zx,g(f)- Universal consis¬ 
tency follows if inff & 'H K +{i}'R'T,g(f) = 'R-t y by Theorem 3.2. That is, d* needs to be 
approximated by functions in the space Hk + {1}- Clearly, this is not always true. A 
counterexample is when K is linear. 

Let g be the marginal distribution of X. Clearly, the rule d* is measurable with respect 
to g. Recall that a probability measure is regular if it is defined on the Borel sets. By 
Lusin’s theorem, a measurable function can be approximated by a continuous function 
when g is regular (Rudin 1987, Theorem 2.24). Thus the RKHS Hk of a universal kernel 
K is rich enough to provide arbitrarily accurate decision rules for all distributions. The 
finding is summarized in the following universal approximation lemma. The rigorous proof 
is provided in Appendix. 

Lemma 3.4. Let K be a universal kernel, and Hk be the associated RKHS. For all distri¬ 
butions P with regular marginal distribution g on X, we have 


inf 

f£H K +{ 1} 


KtM = n 


T,g- 


This immediately leads to universal consistency of RWL with a universal kernel. 

Proposition 3.5. Let K be a universal kernel, and Hk be the associated RKHS. Assume 
that we choose a sequence X n > 0 such that X n —>• 0 and nX n —>• oo. For a measurable 
function g, assume that ^ 0 a ^ mos ^ surely. Then for any distribution P for 

( X,A,R ) with regular marginal distribution on X, we have that in probability, 


lim 7Z(sign(f Dnt A J) = H*. 
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3.4 Convergence Rate 

A consistent rule guarantees that by increasing the amount of data the rule can eventually 
learn the optimal decision with high probability. The next natural question is whether there 
are rules d n with lZ(d n ) tending to the Bayes risk 77* at a specified rate for all distributions. 
Unfortunately, this is impossible as shown in the following theorem. The proof follows the 
arguments in Devroye et al. (1996, Theorem 7.2), which states in the classification problem 
the rate of convergence of any particular rule to the Bayes risk may be arbitrarily slow. We 
provide the outline of the proof in Appendix. 

Theorem 3.6. Assume there are infinite distinct points in X . Let {c n } he a sequence of 
positive numbers converging to zero with > Cl > C2 > ... . For every sequence d n of 
ITRs, there exists a distribution of ( X , A, R) on X x Ax M. such that 

K(d n ) - 77* > 2 Mc n . 

This is a generic result, and applies to any algorithm for ITRs. It implies that rate of 
convergence studies for particular rules must necessarily be accompanied by conditions on 
(X, A, R). Before moving back to RWL, we introduce the following quantity: 

= i n f 7j\\h\\K+ 'R-T, g (h + b)—K^g- 

heH K ,be R 2 ,y 

The term .4(A) describes how well the infinite-sample regularized T-risk ^\\h\\ 2 K +TZ T ,g(h+b) 
approximates the optimal T-risk TVf . A similar quantity is called the approximation error 
function in the literature of learning theory (Steinwart and Christmann 2008, Definition 
5.14). Since .4(A) is an infimum of affine linear functions, .-4(A) is continuous. Thus 

lim .4(A) = inf 77t q (h + b) — 774 a . 

A-H) h£n K ,be R ,y ’ 9 

By Lemma 3.4, for a universal kernel K, lim^o -4(A) = 0. 

In order to establish the convergence rate, let us additionally assume that there exist 
constants c > 0 and fi € (0,1] such that 

.4(A) < c\ 0 , VA > 0. (16) 

The assumption is standard in the literature of learning theory (Steinwart and Christmann 
2008, p. 229). The following theorem is the main result in this section. This theorem 
is proved in Appendix using techniques of concentration inequalities developed in Bartlett 
and Mendelson (2002). 

Theorem 3.7. For any distribution P for ( X,A,R ) that satisfies the approximation error 

assumption (16), take X n = n 2 P+ 1 . For a measurable function g, assume that < 

M 0 almost surely. Then with probability at least 1 — 5, 

lZ(sign(f Dnt x n )) -TV < ^(4/%"^, 

where the constant c is independent of 5 and n, and c decreases as Mq decreases. 

The resulting rate depends on the polynomial decay rate ft of the approximation error, 
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and the optimal rate is about n -1 / 3 when /3 approaches 1. Theorem 3.7 also implies that 
even when the sample size n is fixed, an appropriately chosen g with a small bound Mq 
would improve performance. It is easy to verify that g* in (5) is the function g that 
minimizes E ^ ^^(A^x) ) • So w l ien we choose g* as in RWL, we may possibly obtain a 
smaller Mo, as shown in Figure 1. 

We are particularly interested in the Gaussian RBF kernel since it is widely used and is 
universal. It is important to understand the approximation error assumption (16) for the 
Gaussian RBF kernel. We first introduce the following “geometric noise” assumption for 
P on ( X,A,R ) (Steinwart and Scovel 2007). Let 


S(x) = E(R\X = x, A = 1) - E(R|X =x,A = -1). 


Define X + = {x E X : 5(x) > 0}, X = {x € X : 5{x) < 0}, and X° = {x € X : 5{x) = 0}. 
Now we define a distance function * >— > t x by 


7~x 


'd(x,X° UX+) if 
< d(x, X° U X~) if 
0 if 


x £ X~, 
x £ X + , 
x e x°, 


where d(x,A) denotes the distance of a: to a set A with respect to the Euclidean norm. 
Now we present the geometric noise condition for distributions. 

Definition 3.8. Let X C M p be compact and P be a probability measure of ( X,A,R ) on 
X x A x A4. We say that P has geometric noise exponent q > 0 if there exists a constant 
C > 0 such that 

\5(x)\n(dx) < Ct qp / 2 , t > 0, (17) 

where g is the marginal measure on X. 

The integral condition (17) describes the concentration of the measure \5{x)\dg near 
the decision boundary in the sense that the less the measure is concentrated in this region 
the larger the geometric noise exponent can be chosen. The following lemma shows that 
the geometric noise condition can be used to guarantee the approximation error assumption 
(16) when the parameter a of the Gaussian RBF kernel K a is appropriately chosen. 



Lemma 3.9. Let X be the closed unit ball of the Euclidean space R p , and P be a distribution 
on X x A x A4 that has geometric noise exponent 0 < q < oo with constant C in (17). Let 

i 

a = A 17+T)p. Then for the Gaussian RBF kernel K a , there is a constant c > 0 depending 
only on the dimension p, the geometric noise exponent q and constant C, such that for all 
A > 0 we have 

.4(A) < c\ q /( q+1 \ 

The proof provided in Appendix follows the idea from Theorem 2.7 in Steinwart and 
Scovel (2007) by replacing their 2g(x) — 1 with 5{x). The key point is the following in¬ 
equality, for all measurable /, 


T^T,g(f) — T^r,g — 2E(|5(oj)| • \f — d*\), 
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which is the counterpart of Equation (24) in Steinwart and Scovel (2007). 

Now we are ready to present the convergence rate of RWL with Gaussian RBF kernel 
by combining Theorem 3.7 and Lemma 3.9: 

Proposition 3.10. Let X be the closed unit ball of the Euclidean space M p , and P be a 
distribution on X x Ax M. that has geometric noise exponent 0 < q < oo with constant C 
in (17). For a measurable function g, assume that almost surely. Consider 

a Gaussian RBF kernel K an . Take A n = n 3 9 +1 and a n = X n q p . Then with probability 
at least 1 — 5, 

7l(sign(f Dri}Xn )) - TZ* < c v / log(4/5)n" 3 TF I , 
where the constant c is independent of 5 and n, and c decreases as Mq decreases. 

The optimal rate for the risk is about n -1 / 3 when the geometric noise exponent q is 
sufficiently large. A better rate may be achieved by using techniques in Steinwart and 
Scovel (2007), but it is beyond the scope of this work. 


4 Variable selection for RWL 

Variable selection is an important area in modern statistical research. Gunter et al. (2011) 
distinguished prescriptive covariates from predictive covariates. The latter refers to covari¬ 
ates which are related to the prediction of outcomes, and the former refers to covariates 
which help to prescribe optimal ITRs. Clearly, prescriptive covariates are predictive too. 
In practice, a large number of clinical covariates are often available for estimating optimal 
ITRs. However, many of them might not be prescriptive. Thus careful variable selection 
could lead to better performance. 


4.1 Variable selection for linear RWL 

There is a vast body of literature on variable selection for linear classification in statistical 
learning. For instance, Zhu et al. (2003) and Fung and Mangasarian (2004) investigated 
the support vector machine (SVM) using the Li-norm penalty (Tibshirani 1994). Wang 
et al. (2008) applied the elastic-net penalty (Zou and Hastie 2005) for variable selection 
with an SVM-like method. In this article, we replace the i^-norm penalty in RWL with the 
elastic-net penalty, 

-5ilMli + — w w - 

where ||ru||i = Y^j= i \ w j\ is the G-norm. As a hybrid of the G-norm and ^-norm penal¬ 
ties, the elastic-net penalty retains the variable selection features of the l \-norm penalty, 
and tends to provide similar estimated coefficients for highly correlated variables, i.e. the 
grouping effect, as the .^-norm penalty does. Hence, highly correlated variables are selected 
or removed together. 

The elastic-net penalized linear RWL aims to minimize 


\ II ii 1-^2 t i 1 

Al| Mil + TT W W + - > , —7-7 

2 n f-)TT(ai,Xi) 


T(ai(w T Xi + 6 )), 
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where Ai(> 0) and A 2 (> 0) are regularization parameters. We still apply the d.c. algorithm 
to solve the above optimization problem. Following a similar decomposition, the convex 
subproblem at the (t + l)’th iteration of the d.c. algorithm is as follows, 


min 

w,b 


\ 11 || . A2 t 

Ai \\w 1 H- w w 

111 in 2 


1 n 

+ - y' Ui (ui)i{fi > 0) +$0 (ui)i(fi < 0) 
n L 


2—1 


- , k(Q j i-) Xi ) 




(t) 


Ui 


2—1 


(18) 

where Ui = a,i(w T Xi + b ), and /3^ is computed by (12). The objective function is a sum 
of a smooth function and a non-smooth £i-norm penalty. Since L-BFGS can only deal 
with smooth objective functions, it may not work here. Instead we use projected scaled 
sub-gradient (PSS) algorithms (Schmidt 2010), which are extensions of L-BFGS to the 
case of optimizing a smooth function with an £i-norm penalty. Several PSS algorithms 
are proposed in Schmidt (2010). Among them, the Gafni-Bertseka variant is particularly 
effective for our purpose. After solving the subproblem, we update /3^ t+1 ^ by (12). The 
procedure is repeated until (3 converges. The obtained decision function is f(x) = w T x-\-b , 
and thus the estimated optimal ITR is the sign of f(x). 


4.2 Variable selection for RWL with nonlinear kernels 


Many researchers have noticed that nonlinear kernel methods may perform poorly when 
there are irrelevant covariates presented in the data (Weston et al. 2000; Grandvalet and 
Canu 2002; Lin and Zhang 2006; Lafferty and Wasserman 2008). For optimal treatment 
regimes, we may suffer a similar problem when using a nonlinear kernel. For the Gaussian 
RBF kernel, we assume that the geometric noise condition in Definition 3.8 is satisfied with 
exponent q. When several additional non-descriptive covariates are included in the data, 
the geometric noise exponent is decreased according to (17), and hence so is the convergence 
rate in Proposition 3.10. The presence of non-descriptive covariates thus can deteriorate 
the performance of RWL. 

The variable selection approach proposed in this section is inspired by the KNIFE 
algorithm (Allen 2013), where a set of scaling factors on the covariates are employed to form 
a regularized loss function. The idea of scaling covariates within the kernel has appeared 
in several methods from the machine learning community (Weston et al. 2000; Grandvalet 
and Canu 2002; Rakotomamonjy 2003; Argyriou et al. 2006). 

We take the Gaussian RBF kernel as an example. Define the covariates-scaled Gaussian 
RBF kernel, 


K r) (x, z ) = exp 


J2rij( x j ~ Zjf , 

i=i 


where r] = (771 , • • • ,t? p ) t > 0. Here each covariate Xj is scaled by y/fjj- Setting rjj = 0 is 
equivalent to discarding the j’th covariate. The hyperparameter a in the original Gaussian 
RBF kernel is absorbed to the scaling factors. Similar with KNIFE, we seek ( v,b,fj ) to 
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minimize the following optimization problem: 


min 

v,b,rj 


subject to 


A ilMli + v i v j K rl {Xi,X j ) 


»,j=i 


H ^ ' 7 “ r T(di ( ^ ] VjK^ixi, Xj ) + b 

n n(ai,Xi) V v 


j=i 


»7 > 0 , 


( 19 ) 


where Ai(> 0) and A 2 (> 0) are regularization parameters. Compared with previous non¬ 
linear RWL optimization problem in (15), (19) has an id-norm penalty on scaling factors. 
The objective function is singular at r\ = 0 due to nonnegativity of scaling factors. So (19) 
may produce zero solutions for some of the r), and hence performs variable selection. 

We adopt a similar decomposition trick that has been used several times before. The 
subproblem at the (t + l)’th iteration is 


min 

v ,b,ri 


subject to 


Ai 



^ ) Vi Vj b\ q ( X ,, X j ) 

i,j = 1 



V > 0, 


(ui)I(fj > 0) + 4> 0 (ui)I(ri < 0) 


- ^"(^2? ^h) 


+ E^ t)u *>(2°) 

i= 1 


where Ui = a*( Yl'j= l v j^r){ x i, x j) + 6), and (3^ is computed by (12). Recall that the d.c. 
algorithm is a special case of the Majorize-Minimization (MM) algorithm. Although with 
the covariate-scaled kernel the subproblem is nonconvex, we still apply a similar iterative 
procedure as in the d.c. algorithm, i.e. solving a sequence of subproblems (20), but based 
on the MM algorithm. The subproblem (20) is a smooth optimization problem with box 
constraints. In this article, we use L-BFGS-B (Byrd et al. 1995; Morales and Nocedal 
2011), which extends L-BFGS to handle simple box constraints on variables. After solving 
the subproblem (20), we update itj and (3. The procedure is repeated until convergence. 
Then the obtained decision function is f(x) = Li x) + b, and thus the estimated 

optimal ITR is the sign of /(*). The covariates with nonzero scaling factors t) are identified 
to be important in estimating ITRs. 

This variable selection technique can be applied to other nonlinear kernels, such as the 
polynomial kernel, and even to the linear kernel. However for the linear kernel, we recom¬ 
mend the approach in the prior section, where the subproblem (18) is a convex optimization 
problem with p + 1 variables. Note that the subproblem (20) in this section is nonconvex 
even for the linear kernel, and there are n + p + 1 variables for the optimization problem. 
This is much more challenging. 


5 Simulation studies 

We carried out extensive simulation studies to investigate finite sample performance of the 
proposed RWL methods. 

We first evaluated the performance of RWL methods with low-dimensional covariates. 
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In the simulations, we generated 5-dimensional vectors of clinical covariates x\, ■ ■ ■ , X 5 , con¬ 
sisting of independent uniform random variables U (—1,1). The treatment A was generated 
from {—1,1} independently of X with P(A = 1) = 0.5. That is, 7 r(a,*) = 0.5 for all a 
and x. The response R was normally distributed with mean Q o = ^o(x) + do(x) ■ a and 
standard deviation 1 , where Ho( x ) is the common effect for clinical covariates, and <5o(cc) ■a 
is the interaction between treatment and clinical covariates. We considered five scenarios 
with different choices of fio( x ) and 5q(x ): 


(0) 

tio( x ) 

= 1 + 

Xi 

+ 

X2 

+ 

2x 3 

+ 

0.5x4; 

S 0 ( x ) 

= 0.4(x2 - 

0.25xf - 1). 

(1) 

Vo( x ) 

= 1 + 

Xl 

+ 

x 2 

+ 

2x 3 

+ 

0 . 5 x 4 ; 

S 0 ( x ) 

= 1.8(0.3 - 

- Xl - x 2 ) ■ 

(2) 

Ho{x) 

= 1 + 

Xl 

+ 

X2 

+ 

2x 3 

+ 

0.5x4; 

S 0 ( x ) 

= 1.3(x 2 - 

2xf + 0.3). 

(3) 

Ho( x ) 

= 1 + 

™2 

X 1 

+ 

r 2 

x 2 

+ 

2x§ 

+ 

0.5x1; 

S 0 ( x ) 

= 3.8(0.8 - 

_ /y»2 _ ^*2 \ 

X 1 x 2)’ 

(4) 

M x ) 

= 1 + 

xl 

+ 

x\ 

+ 

2x 3 

1 

6 0 ( x ) 

= 10(1 - 

- *1 - *1)0 

x i + x 2 — 0-2) 


Scenario 0 is similar with the second scenario in Zhao et al. (2012). When X 2 is restricted 
to [— 1 , 1 ], the treatment arm —1 is always better than the treatment arm 1 on average. 
Thus the true optimal ITR in Scenario 0 is to assign all subjects to arm —1. The decision 
boundaries for the remaining four scenarios are determined by x± and x 2 only. The scenarios 
have different decision boundaries in truth. The decision boundary is a line in Scenario 1, a 
parabola in Scenario 2, a circle in Scenario 3, and a ring in Scenario 4. Their true optimal 
ITRs are illustrated in Figure 3. It is unclear how often a circle or ring boundary structure 
will occur in practice, although general nonlinear boundaries are likely to arise frequently. 
The purpose of including these in the simulations is to verify that the proposed approach 
can handle even the most difficult boundary structures. The coefficients in Scenarios 0 ~ 3 
were chosen to reflect a medium effect size according to Cohen’s d index (Cohen 1988); and 
the coefficients in Scenario 4 reflect a small effect size. The Cohen’s d index is defined as 
the standardized difference in mean responses between two treatment arms, that is, 

\E(R\A = 1)-ELR|A = -l)| 
cs —— — ■ 

^/[Var(R\A = 1) + Var(R\A = -l)]/2' 

We compared the performance of the following five methods: 

(1) The proposed RWL using the Gaussian RBF kernel (RWL-Gaussian). Residuals were 
estimated by a linear main effects model. 

(2) The proposed RWL using the linear kernel (RWL-Linear). Residuals were estimated 
by a linear main effects model. 

(3) OWL proposed in Zhao et al. (2012) using the Gaussian RBF kernel (OWL-Gaussian). 

(4) OWL using the linear kernel (OWL-Linear). 

(5) Q-PLS proposed by Qian and Murphy (2011). 

The OWL methods were reviewed in the method section. Since the OWL methods can 
only handle nonnegative outcomes, we subtracted min{rj} from all outcome responses. 
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Figure 3: True optimal ITRs in simulation studies. For subjects in the shade area, the best 
treatment is 1; for subjects in the white area, the best treatment is —1. 
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This is essentially the approach used in Zhao et al. (2012) [personal communication]. 
In the simulation studies, ^i-PLS approximated E(R|X,A) using the basis function set 
(1, X, A, XA), and applied LASSO (Tibshirani 1994) for variable selection. The optimal 
ITR was determined by the sign of the difference between the estimated E(R|W,A = 1) 
and E(R\X,A = -1). 

For each scenario, we considered two sample sizes for training datasets: n = 100 and 
n = 400, and repeated the simulation 500 times. All five methods had at least one tun¬ 
ing parameter. For example, there was one tuning parameter, A, in RWL-linear; and 
two tuning parameters, A and a in RWL-Gaussian. In the simulations, we applied a 10- 
fold cross-validation procedure to tune parameters. For RWL-Linear, we searched over a 
pre-specified finite set of A’s to select the best one maximizing the average of estimated 
values from validation data; while for RWL-Gaussian, we searched over a pre-specified fi¬ 
nite set of (A,<t)’s to select the best pair. The selection of tuning parameters in £i-PLS 
and OWL followed similarly. In the comparisons, the performances of five methods were 
evaluated by two criteria: the first criterion is the value function under the estimated op¬ 
timal ITR when we applied to an independent and large test dataset; the second criterion 
is the misclassification error rate of the estimated optimal ITR from the true optimal ITR 
on the large test dataset. Specifically, a test set with 10,000 observations was simulated 
to assess the performance. The estimated value function under any ITR d is given by 
P*[M(A = d(X))/n(A, X)]/F* n [I(A = d(X))/n(A,X)] (Murphy 2005), where P£ denotes 
the empirical average using the test data; the misclassification rate under any ITR d is 
given by P*[I(d(X) 7 ^ d*(W))], where d* is the true optimal ITR which is known when 
generating the simulated data. 

The simulation results are presented in Table 1. We report the mean and standard 
deviation of value functions and misclassification rates over 500 replications. In Scenario 0, 
the true optimal ITR would assign all subjects to treatment arm —1. When the sample size 
was small (n = 100), £i-PLS, OWL-Linear and RWL-Linear showed similar performance, 
while OWL-Gaussian and RWL-Gaussian were slightly worse. Note that when X 2 can go 
beyond 1, the heterogeneous treatment effect does exist with non-linear (parabola) decision 
boundary. The deteriorated performance of OWL-Gaussian and RWL-Gaussian may be 
due to unexpected extrapolation. When the sample size was large (n = 400), all methods 
performed similarly, and assigned almost all subjects to arm —1. Scenario 1 was a linear 
example. The model specification in £i-PLS was correct, and this method performed very 
well. The proposed RWL methods, with either the linear kernel or the Gaussian RBF 
kernel, yielded similar performance with I'i-PLS, and were much better than OWL methods. 
Another advantage of using RWL versus OWL was also reflected by a smaller variance in 
value functions. The idea of using residuals instead of the original outcomes is able to 
stabilize the variance, as we discussed in the method section. 

The conditional mean outcomes and decision boundaries in the remaining three scenarios 
were nonlinear. Id-PLS, RWL-Linear, and OWL-Linear were misspecified, and hence they 
did not perform very well in these three scenarios. We focus on the comparison between 
RWL-Gaussian and OWL-Gaussian. These two methods equipped with the Gaussian RBF 
kernel should have the ability to capture the nonlinear structure of decision boundaries. 
In Scenarios 2 and 3, RWL-Gaussian outperformed OWL-Gaussian in terms of achieving 
higher values and smaller variances. It was challenging to find the optimal ITR in Scenario 
4 since the decision boundary was complicated and the effective size was small. When 
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the sample size was small (n = 100), the performance of RWL-Gaussian was only slightly 
better than that of OWL-Gaussian, but with larger variance. It might be too hard to 
learn the complicated decision boundary from the limited sample. However, when the 
sample size increased, RWL-Gaussian showed significantly better performance than other 
methods. The excellent performance of RWL-Gaussian confirms its power in finding the 
optimal ITR. Although using the Gaussian RBF kernel, the performance of OWL-Gaussian 
was not comparable with that of RWL-Gaussian. In Scenario 2, OWL-Gaussian performed 
even worse than £i-PLS and RWL-Linear, both of which can only detect linear boundaries. 
While in Scenario 4, OWL-Gaussian gave similar performance as OWL-Linear. Zhao et al. 
(2012) also reported similar finding in their simulation studies that there are no large 
differences in the performance between OWL-Linear and OWL-Gaussian. 

As we demonstrated in Section 2.2, OWL methods tend to favor the treatment assign¬ 
ments that subjects actually received, and this behavior deteriorates finite sample perfor¬ 
mance. We calculated treatment matching factors of these five methods on the training 
data. They are shown in Table 2. We expect the treatment matching factor to be close 
to 1. However, the treatment matching factors for OWL methods were greater than 1. 
OWL-Gaussian achieved the largest treatment matching factors, especially when the sam¬ 
ple size was small. This may partially explain why OWL-Gaussian may not outperform 
OWL-Linear even when the decision boundary is nonlinear. Whereas for RWL methods, 
the treatment matching factors were close to 1. By appropriately controlling the treatment 
matching factors, the aim of RWL methods is to maximize the estimate in (8), and hence 
to improve finite sample performance. 

We applied a linear main effects model to estimate residuals. The model was correctly 
specified for Scenarios 0, 1 and 2, and misspecified for Scenarios 3 and 4. However, the 
superior performance of RWL methods in Scenarios 3 and 4 demonstrates the robustness of 
our methods to residual estimates. We also considered a null model, which was misspecified 
for all scenarios, to estimate residuals. The results were only slightly worse than those shown 
in Table 1 for Scenarios 1 and 2 when n = 100, and other scenarios were similar especially 
when the sample size increased [results not shown]. 

We then evaluated the performance of RWL methods with moderate-dimensional clinical 
covariates. We adopted the same data generating procedure as in the above five scenarios 
except that the dimension of clinical covariates was increased from 5 to 50 by adding 45 in¬ 
dependent uniform random variables U{— 1,1). Thus among all these 50 clinical covariates, 
only x\ and X 2 are attributed to the decision boundaries for Scenarios 1 ~ 4. We repeated 
the simulations, and the results are shown in Table 3. In Scenario 0, all methods performed 
similar, and assigned almost all subjects to the treatment arm —1 when the sample size 
was large (n = 400). In Scenario 1, £i-PLS performed very well because of correct model 
specification and inside variable selection techniques. Although RWL methods showed bet¬ 
ter performance than OWL methods, they were not comparable to £i-PLS. The decision 
boundaries in Scenarios 2^4 are nonlinear. Both RWL and OWL failed to detect the 
decision boundaries in these scenarios. They produced similar performance with either the 
linear kernel or Gaussian RBF kernel. RWL and OWL methods with Gaussian RBF kernel 
may perform particularly poorly when many non-descriptive covariates are present. 

We tested the performance of variable selection approaches described in Section 4 for 
the linear kernel and Gaussian RBF kernel. They are called RWL-VS-Linear and RWL-VS- 
Gaussian, respectively, in this article. There were two tuning parameters, Ai and A 2 , for 
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Table 1: Mean (std) of empirical value functions and misclassification rates evaluated on 
independent test data for 5 simulation scenarios with 5 covariates. The best value function 
and minimal misclassification rate for each scenario and sample size combination are in 
bold. 



n 

= 100 

n 

= 400 


Value 

Misclassification 

Value 

Misclassification 

Scenario 0 (Optimal value 1.43) 

£i-PLS 

1.41 ( 0 . 07 ) 

0.05 ( 0 . 10 ) 

1.42 (0.01) 

0.02 (0.04) 

OWL-Linear 

1.39 (0.11) 

0.07 (0.14) 

1.43 ( 0 . 02 ) 

0.01 ( 0 . 04 ) 

OWL-Gaussian 

1.33 (0.15) 

0.13 (0.18) 

1.41 (0.05) 

0.03 (0.08) 

RWL-Linear 

1.40 (0.05) 

0.06 (0.09) 

1.42 (0.02) 

0.03 (0.05) 

RWL- Gaussian 

1.34 (0.10) 

0.13 (0.13) 

1.40 (0.04) 

0.07 (0.07) 

Scenario 1 (Optimal value 2.25) 

£i-PLS 

2.22 ( 0 . 03 ) 

0.06 ( 0 . 04 ) 

2.24 ( 0 . 02 ) 

0.03 ( 0 . 03 ) 

OWL-Linear 

1.91 (0.22) 

0.23 (0.09) 

2.08 (0.14) 

0.16 (0.06) 

OWL-Gaussian 

1.88 (0.24) 

0.24 (0.09) 

2.08 (0.12) 

0.16 (0.06) 

RWL-Linear 

2.19 (0.04) 

0.09 (0.04) 

2.23 (0.01) 

0.05 (0.02) 

RWL- Gaussian 

2.17 (0.06) 

0.11 (0.04) 

2.22 (0.02) 

0.05 (0.02) 

Scenario 2 (Optimal value 1.96) 

Ll-PLS 

1.71 (0.07) 

0.24 (0.04) 

1.75 (0.01) 

0.22 (0.01) 

OWL-Linear 

1.51 (0.12) 

0.32 (0.05) 

1.59 (0.10) 

0.29 (0.06) 

OWL-Gaussian 

1.49 (0.15) 

0.33 (0.06) 

1.63 (0.11) 

0.26 (0.05) 

RWL-Linear 

1.66 (0.08) 

0.26 (0.04) 

1.74 (0.03) 

0.23 (0.02) 

RWL-Gaussian 

1.75 ( 0 . 09 ) 

0.20 ( 0 . 05 ) 

1.90 ( 0 . 03 ) 

0.10 ( 0 . 03 ) 

Scenario 3 (Optimal value 3.88) 

Ll-PLS 

3.00 (0.11) 

0.38 (0.03) 

3.03 (0.04) 

0.37 (0.01) 

OWL-Linear 

2.98 (0.10) 

0.38 (0.03) 

3.01 (0.02) 

0.38 (0.01) 

OWL-Gaussian 

3.21 (0.18) 

0.32 (0.05) 

3.56 (0.12) 

0.22 (0.04) 

RWL-Linear 

3.15 (0.13) 

0.34 (0.03) 

3.28 (0.04) 

0.31 (0.01) 

RWL- Gaussian 

3.62 ( 0 . 12 ) 

0.19 ( 0 . 04 ) 

3.82 ( 0 . 04 ) 

0.10 ( 0 . 02 ) 

Scenario 4 (Optimal value 3.87) 

Ll-PLS 

2.29 (0.14) 

0.49 (0.08) 

2.38 (0.15) 

0.55 (0.08) 

OWL-Linear 

2.42 (0.14) 

0.55 (0.08) 

2.49 (0.11) 

0.59 (0.06) 

OWL-Gaussian 

2.43 (0.15) 

0.52 (0.07) 

2.59 (0.15) 

0.50 (0.07) 

RWL-Linear 

2.42 (0.14) 

0.54 (0.09) 

2.49 (0.10) 

0.58 (0.08) 

RWL- Gaussian 

2.68 ( 0 . 28 ) 

0.43 ( 0 . 08 ) 

3.49 ( 0 . 08 ) 

0.22 ( 0 . 03 ) 
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Table 2: Mean (std) of treatment matching factors evaluated on the training data for 5 
simulation scenarios with 5 covariates. 



n = 100 

n = 400 

Scenario 0 

4-PLS 

1.00 (0.04) 

1.00 (0.01) 

OWL-Linear 

1.03 (0.07) 

1.00 (0.01) 

OWL-Gaussian 

1.14 (0.24) 

1.02 (0.06) 

RWL-Linear 

1.00 (0.04) 

1.00 (0.01) 

RWL-Gaussian 

1.00 (0.06) 

1.00 (0.03) 

Scenario 1 

4-PLS 

0.99 (0.10) 

0.99 (0.05) 

OWL-Linear 

1.10 (0.08) 

1.04 (0.04) 

OWL-Gaussian 

1.15 (0.13) 

1.05 (0.05) 

RWL-Linear 

0.99 (0.09) 

0.99 (0.04) 

RWL-Gaussian 

0.99 (0.08) 

0.99 (0.04) 

Scenario 2 

4-PLS 

1.00 (0.09) 

1.00 (0.04) 

OWL-Linear 

1.06 (0.09) 

1.03 (0.04) 

OWL-Gaussian 

1.18 (0.20) 

1.10 (0.07) 

RWL-Linear 

1.00 (0.08) 

1.00 (0.04) 

RWL-Gaussian 

1.01 (0.07) 

1.00 (0.04) 

Scenario 3 

4-PLS 

1.00 (0.05) 

1.00 (0.01) 

OWL-Linear 

1.02 (0.06) 

1.00 (0.02) 

OWL-Gaussian 

1.32 (0.17) 

1.14 (0.05) 

RWL-Linear 

1.01 (0.06) 

1.01 (0.04) 

RWL-Gaussian 

1.04 (0.06) 

1.02 (0.04) 

Scenario 4 

4-PLS 

1.00 (0.08) 

1.00 (0.03) 

OWL-Linear 

1.07 (0.10) 

1.02 (0.04) 

OWL-Gaussian 

1.37 (0.34) 

1.27 (0.24) 

RWL-Linear 

1.00 (0.06) 

1.00 (0.03) 

RWL-Gaussian 

1.00 (0.08) 

1.02 (0.04) 
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both approaches. We applied a 10-fold cross-validation procedure to tune parameters from 
a pre-specihed finite set of (Ai, A 2 )’s. The simulation results are also shown in Table 3. In 
Scenario 0, both RWL-VS methods showed similar performance as RWL methods. In Sce¬ 
nario 1, both RWL-VS-Linear and RWL-VS-Gaussian improved the performance of RWL 
methods. When the sample size was 400, the performance was as good as that of IA-PLS. 
For Scenarios 2 ~ 4, we only present results from RWL-VS-Gaussian. When the sample 
size was small (re = 100), the performance was slightly better (in Scenarios 2 and 3) than, 
or similar (in scenario 4) as that of RWL-Gaussian. It is well known that large sample 
sizes are needed to find interactions in models. So for similar reasons, we may expect large 
samples to be necessary to find and confirm the existence of heterogeneity of treatment ef¬ 
fects, and accurately define the decision boundary. In the simulations, when the sample size 
increased, the performance of RWL-VS-Gaussian improved significantly. Variable selection 
is necessary for RWL when there are many non-descriptive clinical covariates. 

6 Data analysis 

We applied the proposed methods to analyze data from the EPIC randomized clinical trial 
(Treggiari et al. 2011). The trial was designed to determine the optimal anti-pseudomonal 
treatment strategy in children with cystic fibrosis (CF) who recently acquired Pseudomonas 
aeruginosa {Pa). Newly identified Pa was defined as the first lifetime documented Pa posi¬ 
tive culture within 6 months of baseline or a Pa positive culture within 6 months of baseline 
after a two-year absence of Pa culture-positivity. Other eligibility criteria have been previ¬ 
ously reported (Treggiari et al. 2009). A total of 304 patients ages 1~12 were randomized 
in a 1:1 ratio to one of two maintenance treatment strategies: cycled therapy (treatment 
with anti-pseudomonal therapy provided in quarterly cycles regardless of Pa positivity) 
and culture-based therapy (treatment only in response to identification of Pa positive cul¬ 
tures from quarterly cultures). All patients regardless of randomization strategy received 
an initial course of anti-pseudomonal therapy in response to their Pa positive culture at 
eligibility and proceeded with treatment according to their randomization strategy. In this 
article, we considered two endpoints over the course of the 18-month study. One is the 
number of Pa positive cultures from scheduled follow-up quarterly cultures, and the other 
is the number of pulmonary exacerbations (PE) requiring any (intravenous, inhaled, or oral) 
antibiotic use or hospitalization during the study. In the original clinical trial, there was no 
significant difference between the two maintenance treatment strategies at the population 
level for either the Pa outcome (p-value 0.222) or PE outcome (p-value 0.280). 

We considered 10 baseline clinical covariates, including age, gender, F508del genotype, 
weight, height, BMI, first documented lifetime Pa positive culture at eligibility versus 
positive after a two-year history of negative cultures (neverpapos), use of antibiotics within 
6 months prior to baseline (anyabxhist), Pa positive at the baseline visit versus Pa positive 
within 6 months prior to baseline only (pabl), and positive versus negative baseline S. 
aureus culture status (sabl). There were three levels (homozygous, heterozygous and other) 
for F508del genotype. We coded genotype as two dichotomous covariates, homozygous and 
heterozygous, each referenced in relation to the “othe” genotype category. Four patients did 
not give consent to have their data put into the databank. We also excluded 17 patients with 
missing covariate values from the analyses. The data used here consisted of 283 patients: 
141 in the cycled therapy, and 142 in the culture-based therapy. Each clinical covariate was 
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Table 3: Mean (std) of empirical value functions and misclassification rates evaluated on 
independent test data for 5 simulation scenarios with 50 covariates. The best value function 
and minimal misclassification rate for each scenario and sample size combination are in bold. 



n 

= 100 

n 

= 400 


Value 

Misclassification 

Value 

Misclassification 

Scenario 0 (Optimal value 1.43) 

£i-PLS 

1.34 (0.15) 

0.11 (0.17) 

1.42 (0.03) 

0.03 (0.06) 

OWL-Linear 

1.36 (0.15) 

0.08 (0.17) 

1.43 ( 0 . 03 ) 

0.01 ( 0 . 03 ) 

OWL-Gaussian 

1.35 (0.14) 

0.10 (0.16) 

1.42 (0.04) 

0.01 ( 0 . 05 ) 

RWL-Linear 

1.38 ( 0 . 11 ) 

0.07 ( 0 . 13 ) 

1.42 (0.04) 

0.03 (0.06) 

RWL-Gaussian 

1.34 (0.13) 

0.11 (0.15) 

1.40 (0.05) 

0.04 (0.07) 

RWL-VS-Linear 

1.34 (0.11) 

0.11 (0.14) 

1.40 (0.04) 

0.05 (0.06) 

RWL-VS-Gaussian 

1.33 (0.12) 

0.13 (0.14) 

1.41 (0.03) 

0.05 (0.06) 

Scenario 1 (Optimal value 2.25) 

4-PLS 

2.17 ( 0 . 11 ) 

0.09 ( 0 . 06 ) 

2.23 ( 0 . 02 ) 

0.04 ( 0 . 03 ) 

OWL-Linear 

1.51 (0.12) 

0.37 (0.03) 

1.68 (0.10) 

0.32 (0.03) 

OWL-Gaussian 

1.49 (0.13) 

0.38 (0.04) 

1.65 (0.11) 

0.32 (0.04) 

RWL-Linear 

1.75 (0.14) 

0.29 (0.05) 

2.14 (0.03) 

0.13 (0.02) 

RWL-Gaussian 

1.77 (0.12) 

0.29 (0.04) 

2.14 (0.03) 

0.13 (0.02) 

RWL-VS-Linear 

2.05 (0.14) 

0.17 (0.07) 

2.22 (0.02) 

0.06 (0.03) 

RWL-V S- Gaussian 

2.02 (0.18) 

0.18 (0.08) 

2.21 (0.08) 

0.06 (0.04) 

Scenario 2 (Optimal value 1.96) 

O-PLS 

1.56 ( 0 . 19 ) 

0.30 ( 0 . 07 ) 

1.73 (0.03) 

0.23 (0.01) 

OWL-Linear 

1.42 (0.14) 

0.37 (0.04) 

1.47 (0.06) 

0.35 (0.01) 

OWL-Gaussian 

1.40 (0.14) 

0.38 (0.04) 

1.47 (0.07) 

0.35 (0.02) 

RWL-Linear 

1.42 (0.12) 

0.36 (0.04) 

1.60 (0.05) 

0.28 (0.02) 

RWL-Gaussian 

1.40 (0.14) 

0.36 (0.04) 

1.62 (0.05) 

0.28 (0.02) 

RWL-V S- Gaussian 

1.56 ( 0 . 15 ) 

0.30 ( 0 . 07 ) 

1.92 ( 0 . 06 ) 

0.09 ( 0 . 05 ) 

Scenario 3 (Optimal value 3.88) 

O-PLS 

2.90 (0.19) 

0.40 (0.05) 

3.00 (0.04) 

0.38 (0.01) 

OWL-Linear 

2.96 (0.14) 

0.39 (0.04) 

3.00 (0.02) 

0.38 (0.01) 

OWL-Gaussian 

2.97 (0.11) 

0.39 (0.03) 

3.00 (0.03) 

0.38 (0.01) 

RWL-Linear 

2.90 (0.18) 

0.40 (0.05) 

3.00 (0.05) 

0.38 (0.01) 

RWL-Gaussian 

2.88 (0.18) 

0.41 (0.05) 

2.96 (0.09) 

0.39 (0.02) 

RWL-V S- Gaussian 

3.17 ( 0 . 33 ) 

0.33 ( 0 . 09 ) 

3.82 ( 0 . 13 ) 

0.09 ( 0 . 05 ) 

Scenario 4 (Optimal value 3.87) 

£i-PLS 

2.32 (0.09) 

0.49 ( 0 . 05 ) 

2.36 (0.09) 

0.52 (0.05) 

OWL-Linear 

2.41 (0.14) 

0.55 (0.08) 

2.48 (0.11) 

0.59 (0.06) 

OWL-Gaussian 

2.43 ( 0 . 13 ) 

0.55 (0.07) 

2.47 (0.11) 

0.58 (0.06) 

RWL-Linear 

2.38 (0.13) 

0.53 (0.07) 

2.44 (0.12) 

0.56 (0.07) 

RWL-Gaussian 

2.37 (0.13) 

0.53 (0.07) 

2.44 (0.12) 

0.56 (0.07) 

RWL-V S- Gaussian 

2.37 (0.12) 

0.51 (0.05) 

3.47 ( 0 . 47 ) 

0.21 ( 0 . 15 ) 
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linearly scaled to the range [—1,+1], as recommended in Hsu et al. (2003). 

For the Pa-related endpoint, we used the ratio of the number of Pa negative cul¬ 
tures to the number of Pa cultures over the follow up period as the outcome. We were 
seeking an ITR to reduce the number of Pa positive cultures, so larger outcomes were 
preferred. We compared seven methods: ^i-PLS, OWL-Linear, OWL-Gaussian, RWL- 
Linear, RWL-Gaussian, RWL-VS-Linear, and RWL-VS-Gaussian. Tl-PLS used the basis 
function set (1 ,X,A,XA) in the regression model. We treated the outcome as contin¬ 
uous, and used a linear main effects model to estimate residuals for RWL methods. We 
used a 10-fold cross-validation procedure to tune parameters. The estimated rule was then 
used to predict the optimal treatment for each patient. The predicted treatment allocation 
is shown in Table 4. To evaluate the performance of estimated rules, we again carried 
out a 10-fold cross validation procedure. The data were randomly partitioned into 10 
roughly equal-sized parts. We estimated the ITR on nine parts of the data using the tuned 
parameters, and predicted optimal treatments on the part left out. We repeated the pro¬ 
cedure nine more times to predict the other parts, and obtained the predicted treatment 
for each patient. The first evaluation criterion was the value function, which is given by 
P n [RI(H = Pred)/ tt(A, X)\/P n [I(A = Pred)/ir(A, X)], where P n denotes the empirical 
average over a fold in cross validation and Pred is the predicted treatment in the cross 
validation procedure. The second evaluation criterion was related to significance tests. The 
test was performed when a 10-fold cross validation procedure is finished, that is, when we 
had the predicted treatments for all patients. By comparing the predicted treatments with 
the treatments that were actually assigned to patients, we divided patients into two groups: 
those who followed the estimated optimal rule and those who did not follow the rule. We 
tested the difference between the two groups using the two-sample t-test to see whether the 
group that followed the estimated rule was better than the other group at the significance 
level a = 0.05. The whole procedure was repeated 100 times with different fold partitions 
in the cross validation. We obtained 1000 value functions and 100 p-values. The mean and 
standard deviation of these value functions, the proportion of significant p-values and me¬ 
dian of these p-values are also presented in Table 4. As reference rules, we also considered 
two fixed treatment rules: assign all patients to (1) cycled therapy arm or (2) culture-based 
therapy arm. We show in Table 4 their value functions from repeated 10-fold cross vali¬ 
dation procedures and one-sided p-values from two-sample t-tests. Note that the p-values 
for fixed rules were not changed during the cross validation procedure. ^i-PLS showed 
the best performance, and identified that one covariate, baseline Pa status, is important 
in estimating ITRs. The significance tests also confirmed its superior performance. Both 
OWL methods failed to detect the heterogeneity of treatment effects. They assigned all 
patients to the cycled therapy arm. Although the performances of RWL methods (without 
variable selection) were only slightly better, on average, than those of OWL methods, they 
did detect some heterogeneity across treatments. Variable selection further improved the 
performance in terms of achieving higher values and better significance tests. Moreover, 
both RWL-VS-Linear and RWL-VS-Gaussian selected the same covariate as ^i-PLS. 

Our results are consistent with results from the trial suggesting that the subgroup 
of patients who were Pa negative at the baseline visit (albeit positive within 6 months 
of baseline to meet eligibility) may have more greatly benefited from cycled therapy to 
suppress Pa positivity during the trial. However, it is important to note clinically that 
further analyses of the trial data demonstrated the comparable effectiveness using culture- 
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Table 4: Comparison of methods for estimating ITRs on the EPIC data with the Pa 
endpoint. Higher values are better. 



Predicted treatment 
#cycled vs #culture 

Mean value 
(std) 

Proportion of 
significant p-values 

Median 
p-value 

£i-PLS 

168 

115 

0.913 (0.040) 

1 

0.004 

OWL-Linear 

283 

0 

0.894 (0.054) 

0 

0.111 

OWL-Gaussian 

283 

0 

0.894 (0.054) 

0 

0.111 

RWL-Linear 

178 

105 

0.896 (0.046) 

0.37 

0.072 

RWL-Gaussian 

169 

114 

0.900 (0.045) 

0.64 

0.0404 

RWL-VS-Linear 

168 

115 

0.909 (0.044) 

0.90 

0.002 

RWL-VS-Gaussian 

168 

115 

0.907 (0.045) 

0.85 

0.007 

Fixed rule (cycled) 

283 

0 

0.894 (0.054) 

0 

0.111 

Fixed rule (culture-based) 

0 

283 

0.864 (0.054) 

0 

0.889 


based therapy as compared to cycled therapy in reducing the overall prevalence of Pa 
positive cultures at the end of the trial. 

The number of PEs during the study was a rate variable. We were seeking an ITR 
to lower the number of PEs, so fewer PEs were desirable. We compared several methods: 
PoissonReg, C-Poisson Reg, OWL-Linear, OWL-Gaussian, RWL-Linear, RWL-Gaussian, 
RWL-VS-Linear, and RWL-VS-Gaussian. PoissonReg fitted a Poisson regression model 
using the basis function set (1 , X, A, XA), computed the predicted number of PEs for a 
particular patient at each arm, and assigned the treatment with smaller prediction to this 
patient. £i~PoissonReg was similar with PoissonReg, but used an i\ penalized term to 
perform variable selection. For OWL methods, we used the opposite of individual annual 
PE rate as the continuous outcome. The annual PE rate for the z-th patient is defined 
as ? * 365.25, where r* is the number of PEs for the z-th patient, and ti is the duration 
(in days) when the z-th patient stayed in the study. For RWL methods, we used a main 
effects Poisson regression model to estimate residuals, as described in Section 2.4. A 10-fold 
cross-validation was used to tune parameters. The predicted treatment allocation is shown 
in Table 5. A similar evaluation procedure as that for the Pa endpoint was also performed, 
where we used instead the annual PE rate for the group that followed the rule as the first 
evaluation criterion. The group-wise annual PE rate is given as *365.25, where 

R is the number of PEs, T is the duration (in days) in the study, P n denotes the empirical 
average over a fold in cross validation and Pred is the predicted treatment assignment in 
the cross validation procedure. For the second criterion, a Poisson regression model was 
used to test whether the group that followed the estimated rule had fewer PEs than the 
other group that did not follow the estimated rule. The results are presented in Table 5. 
We also considered two fixed rules as references. All RWL methods outperformed OWL and 
Poisson regression methods in terms of achieving better PE rates and better significance test 
results. ^i-PoissonReg identified three important covariates, baseline Pa status, baseline S. 
aureus staus and height, for treatment assignment. 

For this case, RWL-VS-Linear and RWL-VS-Gaussian did not improve performances 
comparing with their counterparts without variable selection. RWL-VS-Linear did not 
discard any clinical covariates. As its tuned parameter Ai was very small, RWL-VS-Linear 
gave a similar performance as RWL-Linear. The estimated rule from RWL-Linear was 
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Table 5: Comparison of methods for estimating ITR on the EPIC data with the PE end¬ 
point. Lower annual PE rates are better. 


Predicted treatment Mean annual Proportion of Median 
#cycled vs ^culture PE rate (std) significant p-values p-value 


PoissonReg 

150 

133 

0.742 (0.224) 

0.21 

0.245 

£i-PoissonReg 

120 

163 

0.707 (0.233) 

0.55 

0.042 

OWL-Linear 

141 

142 

0.722 (0.236) 

0.27 

0.123 

OWL-Gaussian 

140 

143 

0.714 (0.247) 

0.33 

0.079 

RWL-Linear 

105 

178 

0.673 (0.226) 

0.82 

0.013 

RWL-Gaussian 

108 

175 

0.667 (0.220) 

0.90 

0.008 

RWL-VS-Linear 

105 

178 

0.676 (0.227) 

0.75 

0.013 

RWL-VS-Gaussian 

108 

175 

0.688 (0.218) 

0.69 

0.024 

RWL-VS-Gaussian 






+ RWL-Gaussian 

102 

181 

0.639 (0.220) 

0.97 

0.001 

Fixed rule (cycled) 

283 

0 

0.725 (0.245) 

0 

0.140 

Fixed rule (culture-based) 

0 

283 

0.823 (0.253) 

0 

0.860 


determined by the following decision function F, 

F = 1.43 * I {gender = female ) + 0.07 * age — 0.58 * I (neverpapos = yes ) 

— 1.01 * I (pabl = +') — 0.17 * I (sabl =' + / ) + 0.02 * weight + 0.01 * height 
—0.17 * I(anyabxhist = yes ) + 0.11 * bmi — 0.90 * I (genotype = homozygous ) 
—0.83 * I (genotype = heterozygous ) — 2.90 

The corresponding ITR is that if F > 0, assign cycled therapy to this patient, otherwise as¬ 
sign the culture-based therapy. The proposed rule suggests that female gender and increas¬ 
ing age are key determinants for the potential of a patient to benefit from cycled therapy, 
which is not surprising since these characteristics are known risk factors for exacerbations. 
Counterintuitively, this rule suggests that Pa positivity at baseline leads against the rec¬ 
ommendation for cycled therapy unlike with the prior outcome of increased Pa frequency. 
The rule also suggests that patients with genotypes other than delF508 heterozygous and 
homozygous may benefit more from cycled therapy as compared to patient with the delF508 
mutation. Overall, these results indicate the importance of evaluating multiple endpoints 
to evaluate consistency in recommendations for treatment based on these proposed rules. 

For RWL-VS-Gaussian, three clinical covariates (age, weight, baseline S. aureus status) 
were identified to be unimportant for estimating ITRs. It is well known that the shrinkage 
by the t\ penalty causes the estimates of the non-zero coefficients to be biased towards zero 
(Hastie et al. 2001). For similar reasons, the estimates from RWL-VS-Gaussian might be 
biased. One approach for reducing this bias is to run the l\ penalty method to identify 
important covariates, and then fit a model without the l\ term to the selected set of 
covariates. We applied RWL-Gaussian to the identified set of covariates. The results are 
also presented in Table 5. The obtained annual PE rate was better than that from RWL- 
Gaussian alone. Thus variable selection is still helpful to improve the performance. 

In summary, the proposed RWL methods can identify potentially useful ITRs. For 
example, the estimated ITR for the Pa endpoint in the EPIC trial is that if the baseline 
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Pa status is positive, assign culture-based therapy to the patient; if the baseline Pa status 
is negative, assign the cycled therapy. As the Pa endpoint was a secondary endpoint in the 
trial, the clinical utility of these findings must be weighed in context of all study findings 
and safety. But nonetheless identifying such a strategy is important to improve personalized 
clinical practice if the strategy is confirmed by future comparative studies. 

7 Discussion 

In this article, we have proposed a general framework called Residual Weighted Learning 
(RWL) to use outcome residuals to estimate optimal ITRs. The residuals may be obtained 
from linear models or generalized linear models, and hence RWL can handle various types 
of outcomes, including continuous, count and binary outcomes. 

The proposed RWL methods appear to achieve better performance compared to other 
existing methods as shown in the simulation studies and real data analyses. The goal of 
RWL is to improve finite sample performance of OWL methods. It is not surprising that 
RWL outperforms OWL in terms of achieving higher value function and smaller variance. 
Two-step methods, such as £i~PLS and ^i-PoissonReg, are generally parametric. Their 
performance largely depends on how close the posited model is to the true model of the 
conditional mean outcome. When the model is correctly specified, two-step methods can be 
more efficient than nonparametric RWL methods. However, the relationship among clinical 
covariates, treatment assignment, and outcome is complex in practice. Misspecified models 
may lead to biased estimates. 

Both two-step methods and RWL require correct model specification to guarantee their 
performance. We consider two levels of model specification. The first one is the model for 
the decision boundary; the other is the model for the outcome. Two-step methods, such as 
£i-PLS, only specify models for conditional outcomes, and the decision boundary is implied 
from these models. Thus the performance of two-step methods are only related to these 
models. However, for RWL, the two levels are separated. RWL methods target the decision 
boundary directly. The model for decision boundary is more critical to performance. RWL 
with Gaussian RBF kernel can approximate any form of decision boundary when the sample 
size is sufficiently large. Even when using linear kernels, RWL is more robust to model 
misspecification on the decision boundary than the two-step methods. If the conditional 
mean outcome is assumed to be linear in the parameters, the decision boundary will be also; 
the converse does not hold. For RWL with linear kernel, the decision boundary remains a 
simple linear form, while the conditional mean outcome is allowed to be nonlinear. 

Another source of robustness of RWL comes from the model for the outcome, in which 
we estimate residuals. By Theorem 3.3, no matter how bad the residual estimate is, RWL 
is still consistent. In the simulation studies, we have investigated the robustness of residual 
estimates on finite samples. We applied the null model, as an example of incorrect model 
specification, to estimate residuals, and achieved satisfactory performance. The simulation 
studies confirm the robustness of RWL. In this article, we used simple models to obtain the 
residuals. However, RWL is quite flexible in the methods used to estimate residuals. When 
the sample size is small or the heterogeneous treatment effect is weak, we may consider a 
complicated method to accurately estimate residuals to improve the performance of RWL. 
Alternatively, nonparametric regression methods, such as support vector regression (Vapnik 
1998) or random forests (Breiman 2001), are an option to eliminate model misspecification. 
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RWL is closely related to a robust augmented inverse probability weighted estima¬ 
tor (AIPWE) of the value function. AIPWE was first introduced into optimal treatment 
regimes by Zhang et al. (2012). Under the setup in this article, i.e., A = {1, —1} and tt (a, x) 
is known, for any rule d , its associated value function can be estimated by an AIPWE, 

AipwE(d)Ay(= n 

n ~L l TT {ai,Xi) TT{ai,Xi) ) 

where m{x) = fii(x)I(d(x) = 1) + (i-i(x)I(d(x) = —1), fii(x) is an estimate of E(i?|X = 
x,A = 1), and /i_ \{x) is an estimate of E(R|W = x, A = —1). If we take g*(x ) = m(x), 
for a fixed d(x), then maximizing AIPWE is equivalent to minimizing the risk in (6). 
Hence AIPWE is essentially a “contrast” weighted learning method, where the contrast is 
a modelled form that links closely to the residual. Conversely, if we take gi(x) = /i_ \{x) = 
g*(x), where g*(x) estimates g*(x) in (5), then maximizing AIPWE is again equivalent to 
minimizing (6). However, RWL can now be obtained by replacing the 0-1 loss with the 
smoothed ramp loss, and hence RWL can be viewed as a modification of AIPWE with 
alternative useful connections to classification methods. 

Variable selection is critical for good performance of RWL, as demonstrated in the sim¬ 
ulation studies and data analyses. We suggest that practical use of RWL should necessarily 
be accompanied by variable selection. In Section 4, we introduce the concepts of predic¬ 
tive and prescriptive covariates. It has been noticed in practical settings that the main 
effects, rather than treatment-covariate interactions, tend to explain most of the variability 
in the outcome, and thus they are important for good prediction (Gunter et al. 2011). It 
is not uncommon that interactions between prescriptive covariates and treatment are over¬ 
looked due to their small predictive ability. Hence variable selection techniques designed for 
prediction applications might not work perfectly in finding the prescriptive covariates. In 
contrast, RWL focuses on treatment-covariate interactions, and variable selection in RWL 
only targets the prescriptive covariates. 

RWL methods with linear and Gaussian RBF kernels were discussed in this article. The 
decision rule from linear RWL is easy to interpret, but at the risk of misspecification. As 
a universal kernel, the Gausian RBF kernel is free of model misspecification. An obvious 
downside is that the final decision rule may be difficult to interpret, and clinical practitioners 
may not be comfortable using “black box” decision rules. One compromise is to apply linear 
RWL with a rich set of bases, for example, including two-way and three-way covariate 
interactions. 

When the endpoint outcome is the survival time, observations are commonly subject to 
right censoring because of subject dropout or censoring due to the end of study. Currently, 
RWL does not cover this type of outcomes in theory due to the censorship. We recommend 
using martingale residuals (Fleming and Harrington 1991) as alternative outcomes. Specif¬ 
ically, residuals in RWL could be opposites of martingale residuals that are estimated by a 
Cox proportional hazards model on clinical covariates but excluding treatment assignment. 
Actually, the idea is not new. Therneau et al. (1990) suggested that martingale residuals 
from a null Cox model could be used as outcomes for Classification and Regression Trees 
(CART). The rigourous theoretical justification is under development. Here we provide an 
intuitive explanation. Let T denote the survival time, and C denote the censoring time. 
The observed data quadruplet is ( X,A,R = T A C, A = I(T < C)). Firstly, for general 
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survival analysis, ( R , A) is seen as the censored outcome. We view the survival data from a 
different perspective. We treat A as the outcome, and (X,R) as clinical covariates. In the 
framework of counting processes, the martingale residual can be explained as the difference 
between the observed number of failures (0 or 1) during the time in the study for a subject 
and the expected numbers based on the fitted model. Thus it is almost in line with our 
derivation of RWL, although R may not be appropriate as a clinical covariate. Secondly, 
martingale residuals from a Cox model fitted excluding a covariate of interest can be used 
to determine the appropriate functional form of the covariate (Therneau et al. 1990). If we 
estimate martingale residuals by ignoring treatment assignment, they contain information 
related to the heterogeneity of treatment effects. Thirdly, martingale residuals have some 
properties reminiscent of linear models, for example, the residuals are asymptotically un¬ 
correlated. Moreover, the weighted sum of martingale residuals is zero, which is essential 
for RWL to control the treatment matching factor. 

Several extensions are worthy of further investigation. In this article, we only consid¬ 
ered two-arm trials. In practice, some trials involve multiple arms, i.e. have three or more 
treatment arms. A recent review by Baron et al. (2013) reported that 17.6% of published 
randomized clinical trials in 2009 were multiple-arm. Because of the number of possible 
comparisons, it is more complex to find ITRs in such trials compared with two-arm trials. 
It would thus be worthwhile to extend RWL to multiple-arm trials. In the machine learning 
literature, two schools of approaches are generally suggested for multi-category classifica¬ 
tion. One is to directly take all classes into consideration. Multicategory SVMs (Lee et al. 
2004) and penalized logistic regression (Zhu and Hastie 2004) are two examples. The other 
school is to construct and combine several binary classifiers. One-versus-one, one-versus-all, 
and more general error-correcting output codes (Dietterich and Bakiri 1995, ECOC) are 
in this school. Similar generalizations may be possible for finding ITRs for multiple-arm 
trials. 

For chronic diseases, multi-stage dynamic treatment regimes are more useful than single- 
stage optimal treatment regimes. In the context of multi-stage decision problems, a dynamic 
treatment regime is a sequence of decision rules, one per stage of intervention, for adapting 
a treatment plan over time to an individual. Zhao et al. (2014) has extended OWL to 
dynamic treatment regimes. Considering the improved performance of RWL over OWL, it 
is of interest to extend RWL to dynamic treatment regimes. 

The intended use of the estimated optimal ITRs discovered using the proposed approach 
is for treating new patients. The theoretical, simulation and data analysis results demon¬ 
strate that such estimated ITRs are likely to lead to improved clinical outcomes for these 
patients. However, since the data used for estimating the ITRs should not also be used 
for confirmation, it is expected that an additional randomized clinical trial comparing the 
estimated ITR with standard of care, or some other suitable treatment comparison, will be 
used for confirmation as is typically done for new candidate treatments (Zhao et al. 2011). 

APPENDIX A. Proofs 

Proof of Lemma 2.1 
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Proof. Note that 


E 


'l(A?d(X)) 


tt(A,X) 


X = E (I (d(X) ± 1) \X, A = 1) + E (l(d(X) / -1) | X,A = -l) = 1. 


The desired result follows easily. 
Proof of Theorem 2.2 

Proof. For any measurable function g 
'R-g(X) 


Var 

= Var 


n(A,X) 

R-g(X ), 

7 r(A,X 

R-g(x) 


I {A + d(X)) 


I (A + d(X))) + Var ( 9{X) + d(X)) 


V AA,X) 

+2C ° V W I(A * d(X)) '>* « X » 

It suffices to show that the covariance term is zero. Applying (21), we have 

'R-g(X) 


E 

= E 
= 0. 


7 t(A,X 
R 


X 


’-I(A + d(X)) 


l(A?d(X)) X)-~g(X )E 


'l{A?d(X)) 


vr (A,X) 


X 


Note that 


~g{X) = E 


R 


-I{A / d(X)) 


X 


vr (A,X) 

= E(R\X, A = 1)1 (d(X) + 1) + E(R\X,A = -l)l(d(X) + -l). 


( 21 ) 

□ 


( 22 ) 


Thus we have, 


E 


R-g(X) 


l{A^d(X)) 


X 


7T(A,X)2 

= E(R-g(X)\X,A = 1)1 (d(X) + 1)/tt(1,X) 

+E (R - g(X)\X, A = -1) l(d(X) + -l)/vr(-l, X) 
= 0 . 


The desired result follows easily. □ 

Proof of Theorem 3.1 

Proof. Given X = x, for any measurable function /, similar reasoning to that used in the 
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proof of Lemma 2.1 yields, 


E 


( T{Af(X )) 
^ vr (A,X) 



= 2 . 


Then the conditional T-risk is 

= E(R\X = x,A = l)T(f(x)) + E(R\X = x,A = -1 )T(-/(*)) - 2s(*) 

= (E(T|X = *, A = 1) - E(T|X = tc, A = -1)) T(f(x)) 

+2E(i?|X = x,A = —1) — 2g(x). (23) 

If E [R\X = x,A = 1] — E [7?,| JC = x,A = —1] > 0, any function f(x) > 1 minimizes the 
conditional T-risk; similarly, if E [i ?|X = x,A = 1]—E [R\X = x, A = —1] < 0, any function 
f(x ) < —1 minimizes the conditional T-risk. For either case, sign(ff ) = d*. 

For the second part, by applying (23), 

E {l^xl nAd ' [Xm = ' E {^$Q T{ArTjX))]X = x ) 

= (E(T|X = x,A = 1) -E(R\X = x,A = -1)) (T{d*(x)) - T(ff g (x))) = 0. 

The desired result follows by taking expectations on both sides. □ 

Proof of Theorem 3.2 

Proof. Given X = x. By applying (23), for any measurable function /, we have 

e =*) - e (ifM T{Af ^ {x))]x=x ) 

= (E(R\X = x,A = l)~ E(R\X = x,A = -1)) (T(f(x)) - T(f^ g (x))) . 

Similarly, 

E (^fxj I(i + sign(f(X)))\X = *) - E + d*(X))\X = x^j 

= (E(R\X = x,A = 1) -E(T|X = x,A = -1)) (I (sign(f(x)) / 1) - I(d*(x) ± 1)). 

From the proof of Theorem 3.1, when E(T|X = x,A = 1) > E(T|X = x, A = —1), 
ff g (x) > 1 and d*{x) = 1, so T(ff g (x)) = 0 and I(d*(®) ^ 1) = 0; when E(T|X = x,A = 
1)’< E(R\X = x,A = -1), f* g {x)< -1 and d*{x) = -1, so T(ff g (x)) = 2 and I (d*{x) ± 
1) = 1. Note that, for any measurable function /, 1 > T(f(x )) — I (sign(f (x)) / 1) > 0. 
Thus it is easy to check that when E(T|X = x,A= 1) > E(T|X = x,A = —1), 

T(f(x)) - T{fr tg [x)) > I (sign(f(x)) ± 1) -1 (d*(x) + 1), 

and when E(T|X = x, A = 1) < E(T|X = x, A = —1), 

T(f(x)) - T(fr >g (x)) < I (sign(f(x)) ± 1) -1 [d\x) + 1). 
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So, for either case, we have 


E (mXx) 104 ^ S ^(/W))l* = - E + d*(X))\X = x 

s E = *) ‘ E {l ^§ TW ^ x ^ x =») ■ 

The desired result follows by taking expectations on both sides. □ 

Proof of Theorem 3.3 

Proof. Let L(h,b) = (R — g(X))T(A(h(X) + b))/Tr(A,X). For simplicity, we denote fD n ,x n , 
hD n ,\ n and b Dni \ n by f n , h n and b n , respectively. By the definition of h Dn ,\ n and b Dnt \ n , 
we have, for any h € Hr and b G R, 

P n (L(h n ,b n )) < P n(L(h n ,b n )) + Y\\hn\\ 2 K < P n(L(h,b)) + y||/l || 2 K , 

where P n denotes the empirical measure of the observed data. Then, lim sup n P n (L(h n , b n )) < 
P (L(h, b )) = TZr,g{h + b ) with probability 1. This implies 

lim sup F n (L(h n , b n )) < inf 7 Z T , g (h + b) < P (L(h n ,b n )) 

n heH K ,be«. 

with probability 1. It suffices to show P n (L(h n ,b n )) — P (L(h n ,b n )) —> 0 in probability. 

We first obtain a bound for ||/i n ||^. Since P n (L(h n ,b n )) + A n ||/i n ||^/2 < P n (L(h,b)) T 
X n \\h\\ 2 K /2, for any h G PLr and b € M, we can choose h = 0 and b = 0 to obtain, 
P n (L(h n ,b n )) + A n ||/i n ||^/2 < P n ((i2 — g(X))/ir(A, X)). Note that 0 < T(u) < 2. We thus 
have, 

AnllM^ < 2P n (|i? — g(X)\/ir(A,X)) < 2M 0 . 

Let M\ = sJ2Mq. Then the Pi k norm of \f\fhn is bounded by M\. 

Next we obtain a bound for b n . We claim that there is a global solution ( h n ,b n ) such 
that h n {xi) + b n € [— 1 , 1] for some i. Suppose there is a global solution (h' n , b' n ) such that 
\h' n {xi) + b' n \ > 1 for all i. Let 5 = \h' n (xi 0 ) + b' n \ = mini<,< n \h' n (xi) + b ' n | > 1. Then let 
h n = h' n and b n = b' n — {5 — 1 )sign(h' n (xi 0 ) + b' n ). It is easy to check that h n (xi 0 ) + b n = 1 
if h' n (xi 0 ) + b' n > 1, and h n (xi 0 ) + b n = — 1 if h' n {xi Q ) + b' n < — 1; furthermore when i ^ io, 
h n {xi) + b n > 1 if h' n {xi) + b' n > 1, and h n (xi) + b n < -1 if h' n [xi) + b' n < - 1 . So 
T(h n {xi) + b n ) = T{h! n {xi) + b' n ) for all i. Hence ( h n ,b n ) is a global solution and satisfies 
our claim. Now if a solution ( h n ,b n ) satisfies our claim, we then have, 

\bn\ T IT l^ , n(®io)| < 1 + 11 h n | |oo- 

Note that ||/i||oo < Ca'IWIat- We have, 

I y/ A/i b n T \ n -\- Cr\JX n 11 h n \\ r. 

Since A n —> 0, and Cr and \/A^||/i n ||x are both bounded, we have \y/\fb n \ is bounded too. 
Let the bound be M 2 , i.e. \y/\nb n \ < M 2 . 

Note that the class {\f\fh : ||\/A^/i||a: < M{\ is a Donsker class. So {y/Xf(h + b) : 
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|| \fKih\\K < Mi, |\/A^ 6 | < M 2 } is also P-Donsker. Consider the function 

u < — a/A, 

— VA < u < 0 , 

0 < u < V% 
u > V\. 

We have T\(\/Xu) = s/\T(u). Since T\(u) is a Lipschitz continuous function with Lips- 
chitz constant equal to 2 , and bounded, the class {y/X^L(h,b) : ||\/A^ 7 i||^ < 

Mi, \ \f\nb\ < M 2 } is also P-Donsker. Therefore, 

V / nA^(P n - F)L(h n , b n ) = O p ( 1). 

Consequently, from n\ n —>• 00 , P n (L(/i n , b n )) — P (L(h n ,b n )) —> 0 in probability. □ 

Proof of Lemma 3.4 

Proof. Fix any 0 < e < 1. d*(x) = sign(E(fi|W = x,A = 1) — E(f?|X = x, A = — 1)) 
is measurable. Since /r is regular, using Lusin’s theorem in measure theory, we know that 
d*(x) can be approximated by a continuous function f'{x) € C(X) such that fi(f'(x) 7 ^ 
d*(x))<^. Thus 


T x (u) = 


2\/A 

2\/A —-ty(\/A + 
^-“) 2 


if 

if 

if 

if 


Then, 


E {^r T{Af,{xmx = x ) - E (^^r T(Ar(x))|x =* 

(E(f?|X = x, A = 1) - E(i2|X = x,A = - 1 )) (T(f'(x)) - T(<f (*))) . 


KrM') - n T,g = I KrM') - KT,g(d*)\ 


j (E(R\X = x,A = 1) -E(R\X = x,A 


-1)) (T(f\x))-T{d*{x)))ii{dx) 


< 


J (E(R\X = x,A = l)-E(R\X = x,A 
!(/'(*) / d*(x))n(dx). 


-1)) T(f\x))-T(d*(x)) 


Since \R\ < M and 0 < T(u) < 2, 

n T , g (f')-K* Tt9 <e 

Since K is universal, there exist a function f" € Hk such that || f" — f \\00 < 4 ^. Note 
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that T(-) is Lipschitz continuous with Lipschitz constant 2. Similarly, 


\n T M")-K T M')\ 

j (E(i?|X = x,A= 1) - E(R\X = x, A = -1)) (T(f"(x)) - T(f(x)))ii(dx) 


< 2 


{E(R\X = cc, A = 1) - E(R\X = x,A = -l)) f'{x) - f'{x) dx ) < e. 


By combining the two inequalities, we have 

n Tig {f")-n^ g <2e. 

Noting that f" £ and letting e —> 0, we obtain the desired result. □ 

Proof of Theorem 3.6 

Proof. The proof follows the idea in Devroye et al. (1996, Theorem 7.2). Since the proof is 
very similar, we only provide a sketch to save space. 

Let b = 0 . 6162^3 • • • be a real number on [0,1] with the given binary expansion, and let B 
be a random variable uniformly distributed on [0,1] with expansion B = O.B 1 B 2 B 3 • • •. Let 
us restrict ourselves to a random variable X with the support {* 1 , X 2 , • • • } where Xi € X. 
For simplicity, we recode the support of X as {1,2, • • • }. Let 

P(X = i)=pi, i> 1, (24) 

where p\ > P 2 > • • • > 0, and Yli^n+iPi — m ax(8c n , 32np n+ i) for every n. Such pf s 
exist by Devroye et al. (1996, Lemma 7.1). Let A £ {1,-1} be a binomial variable with 
ir(A, X) = 0.5. For a given b, set R = AM if bx = 1, and R = —AM if bx = 0. Then the 
Bayes rule is d*(X) = (2 * bx — 1)- Thus each b £ [0,1] describes a different distribution of 
( X,A,R ). Introduce the shortened notation D n = {(Xi, A\, R±), ■ ■ ■ , (X n , A n , R n )}. Let 
d n be a rule generated by data D n . Define d l n = d n (i) for i = 1, • ■ ■ , n. Let A7 Z n (b) be the 
excess risk of the rule d n for the distribution parametrized by b, and AT Z n (B) be the excess 
risk of the rule d n for the random distribution. 


= E 
= E 


A K n (B) 
R 


t:(A,X 


-I(j4 ^ dn(X)) 


B 


-E 


R 


-1(A ± d\X)) 


B 


7 t(A,X 

(E(R\B, X,A = 1)~ E(R\B, X,A = -1)) (l (d n (X) + 1) - I (d*{X) / 1)) | B 
= 2ME(l(d n (X) / d*(X))\B) 

= 2ME(l(d n (X)^2B x -l)). 


Let L n (B) = E(l(d n (X) / 2 Bx — !))• Then we have, 


L n {B) = Y J PiK<^2B i -l). 

i=l 


40 


















Following the same arguments used in Devroye et al. (1996, Theorem 7.2), we have 


P(L n (B) < 2c n \D n ) < P( piBi < 2 c n ) < 


—2 n 


i=n-\-l 


Hence we have 


sup mt ih - 

b n \ 2c n 


> EE inf 




n V 2 Cn 
oo 


Xi,X 2 ,--. 


> E 1-J] E(P(L n (5) < 2c n |D n )|X 1 , X 2 , • • •) 


2=1 


> l-X]e- 2ri = > r- 

2=1 


e 2 - 1 2 


Here we are omitting many steps. Refer to Devroye et al. (1996, Theorem 7.2) for details. 
The conclusion is that there exists a b for which AlZ n (b) > 2 Mc n , n = 1, 2, • • • . □ 

Proof of Theorem 3.7 

Proof. Define the random variable S = • We consider a probability measure on the 

triplet (X, A, S ) instead of on (X, A, R). Let D n = {X % . Ai, 5 /}” =1 be independent random 
variables with the same distribution as (X, A, S). Let P n be the empirical measure on D n . 
For simplicity, we denote f Dn ,\ n , h Dnj \ n and b Dni \ n by f n , h n and b n , respectively. Let 
(h \ n , b \ n ) be a solution of the following optimization problem: 

Let L(h, b) = ST{A(h{X) + b)). Then 


KT, a (fn)-KTMT, 9 ) 

< (P - P n)L(h n , b n ) + ( y||h n | | 2 + P n L(h n , b n )) - (■y ||h A „|| 2 + F n L(h Xn , 6 A J) 

+ (P n — P)L(h / \„, &An) + Al(An) 

< (P — P n )L(/i n , 6 n ) + (P n — ¥)L(h\ n , b\ n )) + *4(A n ). 

We first estimate the second term by the Hoeffding inequality (Steinwart and Christmann 
2008, Theorem 6.10). Since \L(h,b)\ < 2Mq, we thus have, with probability at least 1 — 5/2, 


(Pn - P )L(h K ,b Xn ) < M o y (25) 

By the arguments used in the proof of Theorem 3.3, we have ||/i n ||ir < and 

\b n \ < 1 + Ck^- Then let T= {(M) efexl : \\h\\ K < yj^,\b\ < l + &yf } . 
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Let L(h, b) 


S 


T{A{h{X) + b))- 1 


For the first term, (P — P n )L(h n , b n ), 


(P-P n )L(h n ,b n ) < sup (P-P n )L(h,b) 

(h,b)&T 

= sup (P - F n )L(h, b) + (P - P„)L(0,0). 

{h,b)eT 


When an (cCj,aj,Sj) triplet changes, the random variable sup(/j fe ) g j-(P — P n )L(h,b) can 
change by no more than ^Mi. McDiarmid’s inequality (Bartlett and Mendelson 2002, 
Theorem 9) then implies that with probability at least 1 — 5/4, 


sup (P — F n )L(h, b) < E sup (P — ¥ n )L(h, b) + Mq 
( h,b)£T (h,b)&F 

A similar argument, together with the fact that EP n L(0, 0) = PL(0,0), shows that with 
probability at least 1 — 5/2, 


21og(4/5) 


n 


(P - P n )L(h n ,b n ) < E sup (P — ¥ n )L(h, b) + 2M 0 

{h,b)&T 

Let D' n = {Xj, A!-. S(}” =1 be an independent sample with the same distribution as (X, A, S ). 
Let P^ denote the empirical measure on D' n . Let a be a uniform {±l}-valued random 
variable, and a±,... ,a n be n independent copies of a. Then we have 


2 log (4/5) 


n 


E sup (P — P n )L(h,b) = E sup E( P' n L(/i, b) — P n L(h, b) 
(h,b)eT (h,b)&F 


D r 


< 2E sup ¥ n aL(h, b ) 

{h,b)£T 

< 2EE^ sup \P n aL(h,b)\ 

' {h,b)£T 


S i , - 4 .^, X 1 , • • « 


n 


< 1M/ ° E sup [ V ai(h(Xj) + b ) 

n (h,b)£F 1 ^ 


The last inequality is due to the contraction inequality (Ledoux and Talagrand 1991, Corol¬ 
lary 3.17). The preceding can be further majorized by using Lemma 22 in Bartlett and 
Mendelson (2002), 


E sup (P — P n )L(h,b) < 

(h,b)GT 

< 


16 M 0 


E 


n 


sup 


(h,b)eT 1 i=1 




+ + C K 


n 


2 Mq 


HE 


Oi 


i= 1 


16Mp / 2 Mq ^ 
y/n \j \ n 


16 M 0 
y/n 


(1 + C K 



16M 0 

y/n 


(1 + 2 Ck 
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Then we have that with probability at least 1 — 5/2 , 


-P n)L(h n ,b n ) < ^^(1 + 2 C K 


n 


2 Mfl 


A, 


2 Mq 


2 log (4/<5) 


n 


By the assumption, (25), and (26), we obtain that with probability at least 1 — <5, 
1^T,g(fn) — ^T.g — Mq\ - ——~ 4- 7 ~(1 + 2C/f \/ ) + 2Mq 


n 


n 


2 log (4/(5) 


(26) 


+ cA£. 


n 


Let A n = n 2 ' 3+1 . By Theorem 3.2, we obtain the final result that with probability at least 

i-s, 

H(sign(f n )) - TV < cv / log(4/<5)n _2 TFT. 

Here c = Mo (16 + 3\[2 + 32CkV^^o) + c - This completes the proof. □ 

Proof of Lemma 3.9 


Proof. We first introduce a lemma. It is revised from Lemma 4.1 in Steinwart and Scovel 
(2007) to adapt to our settings for individualized treatment rules. 

Lemma A.l. Let X be the closed unit ball of the Euclidean space and P be a distri¬ 
bution on X x A x Xi with regular marginal distribution on X. Recall 5(x) = E(i?|Af = 
x, A = 1) — E(i?|AC = x, A = —1) for x € X. On X := 3X we define 


5{x) 



if ||*|| < 1, 
otherwise, 


where || • || is the Euclidean norm. We also write X + = {* € X : <5(*) > 0}, and 
X~ = {x £ X : S(x) < 0}. Finally let B(x,r) denote the open ball of radius r about x in 
M p . Then for x £ X + , we have B(x,t x ) C X + , and for x £ X~ , we have B(x,t x ) C X~ . 

The proof is simple, and is the same as that of Lemma 4.1 in Steinwart and Scovel 
(2007). We omit the proof here. In the lemma, the support is enlarged to ensure that all 
balls of the form B(x, t x ) are contained in the enlarged support. We return to the proof of 
Lemma 3.9. 

Let L 2 (P p ) be the L 2 -space on M p with respect to Lebesque measure, and % (T (M P ) be 
the RKHS of the Gaussian RBF kernel K a . The linear operator V a : L 2 (MP) -£ TL^iW 1 ) 
defined by 


VJ(x) 


(2a) d / 2 f 

vr d / 4 Jmp 


0 -2a 2 \\x-y\\ 


'Z(y)dy, 


i £ L 2 (M p ), * € M p , 


is an isometric isomorphism (Steinwart et al. 2006). Thus we have, 


A(A) < inf 


A, 


e£L 2 (Rp) 2 


1 l 2{ RP)+^T, 9 (y^)-n, g - 


(27) 
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With the notation of Lemma A.l we fix a measurable fp:X—>[— 1,1] that satisfies fp = 1 
on * + , fp = -i on X , and fp = 0 otherwise. For l := (a 2 /tt) p ^ 4 fp, we immediately 


obtain, 



(28) 


where 6 {p) denotes the volume of X. As shown in the proof of Theorem 3.2, we have 


n T ,g(VJ) - R* Tg = E(|«S(®)| • \T(VJ(x)) - T(d*(x))\) < 2E(|<J(*)| • \VJ(x) - d*(x)\). 

Following the same derivations as in the proof of Theorem 2.7 of Steinwart and Scovel 
(2007), we also obtain 

\V a Z{x)-d*{x )| < 8 e- CT2r - /(2p) . 

The geometric noise assumption yields 

n T ,g(VJ) - < 16E(|,5(*)|e- ,T2T - /(2p) ) < l6C(2p) qp / 2 a~ qp . (29) 

Combining (27), (28) and (29) yields 

/oi 2 \ P/2 

A(X) < ( J 9 2 {p) A/2 + lQC{2p) qp / 2 a~ qp . 

1 

The desired result now follows by taking a = A . □ 
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